SEED-SET • ICLR 2026

ICLR 2026 | International Conference on Learning Representations

SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing

A Bayesian experimental design framework for discovering ethically informative test scenarios when objectives conflict, preferences are subjective, and evaluation budgets are small.

Bayesian experimental design Preference learning Multi-objective testing LLM-assisted evaluation
Anjali Parashar1*, Yingke Li1, Eric Yang Yu1, Fei Chen1, James Neidhoefer1, Devesh Upadhyay2, Chuchu Fan1
1 LIDS, MIT  •  2 Saab  •  * Corresponding author
SEED-SET overview figure

SEED-SET maintains separate models for objective outcomes and stakeholder preferences, then selects the next tests that are most informative about both.

5-minute summary

What SEED-SET does

Autonomous systems are increasingly deployed in settings where failures are not just technical, but ethical. SEED-SET turns ethical benchmarking into an active test design problem: a sample efficient strategy to learn which tests are most informative about both system behaviour and stakeholder values.

Problem

No single ethical metric

Real systems trade off multiple objectives such as fairness, cost, priority, and resilience. Improving one can hurt another.

Challenge

Stakeholder preferences differ

A regulator, a grid engineer, and a community advocate can prefer very different outcomes, and those preferences are hard to write down analytically.

Approach

Learn what to test next

SEED-SET models objective outcomes and pairwise stakeholder preferences separately, then actively proposes the next most useful tests.

Why this problem is hard

Ethical benchmarking is difficult for three reasons

01

No universal objective

In the power-grid example, we care about voltage fairness, deployment cost, priority coverage, and resilience. Different stakeholders care about different objectives, so there is no single scalar score that captures “ethical” behaviour for all users.

02

Subjective stakeholder values

Different stakeholders also care about different trade-offs. Those preferences are often qualitative, so SEED-SET learns them from pairwise comparisons.

03

Massive scenario spaces

Even moderate system models can induce high-dimensional spaces of possible tests. Exhaustive evaluation is infeasible, which makes active test selection essential.

Core idea

Treat ethical testing as a Bayesian experimental design problem: maintain uncertainty over both what the system does and what the stakeholder prefers, then pick the next scenario that reveals the most about both.

Interactive pipeline

A compact view of the full loop

Click a stage to see what it contributes.

Test parameters

A candidate scenario x encodes a proposed test. In the OPF setting, x describes DER placement and reactive power settings. SEED-SET scores many such candidates and decides which ones are worth evaluating next.

Key takeaways

What the paper shows

Candidate quality

Up to 2× more optimal test candidates found than baselines under the same evaluation budget.

Coverage
1.25×

Better coverage of high-dimensional search spaces, helping the test suite reveal more diverse failure modes.

Interpretability
Two-model

Separate models for outcomes and preferences make it clearer whether uncertainty comes from system behaviour or stakeholder values.

Different criteria collect different kinds of scenarios

Consider two ethical criteria: Criteria 1- Low Cost (most important), high fairness, high resilience, high priority, Criteria-2: High Priority, high fairness, high resilience, low cost (less important).

Comparison figure showing that collected scenarios under different criteria lead to different observed priority and cost trade-offs
Criteria-1 scenarios show lower average cost compared to Criteria-2. This also leads to lower average priority, because cost and priority are coupled, and Criteria-1 strongly favors low cost. Our approach succesfully generates scenarios that align with hidden complex ethical criteria, learnt from pairwise evaluation.

Different stakeholders produce different test suites

Click below to see how the emphasis changes depending on whose preferences are embedded.

SEED-SET will tend to surface scenarios where underserved buses are ignored or poorly covered, because these are the tests most informative for a stakeholder focused on priority and access.

Resources

Where to go next

Abstract

As autonomous systems such as drones become increasingly deployed in high-stakes, human-centric domains, it is critical to evaluate ethical alignment since failure to do so can impose imminent danger to human lives and long-term bias in decision-making. Automated ethical benchmarking of these systems is understudied due to the lack of ubiquitous, well-defined metrics for evaluation, and stakeholder-specific subjectivity, which cannot be modeled analytically. To address these challenges, we propose SEED-SET, a Bayesian experimental design framework that incorporates domain-specific objective evaluations and subjective value judgments from stakeholders. SEED-SET models both evaluation types separately with hierarchical Gaussian Processes and uses a novel acquisition strategy to propose interesting test candidates based on learned qualitative preferences and objectives that align with stakeholder preferences. We validate our approach for ethical benchmarking of autonomous agents on two applications and find our method to perform the best. Our method provides an interpretable and efficient trade-off between exploration and exploitation, generating up to 2× optimal test candidates compared to baselines, with 1.25× improvement in coverage of high-dimensional search spaces.

BibTeX

@inproceedings{parasharseed,
  title={SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing},
  author={Parashar, Anjali and Li, Yingke and Yu, Eric Yang and Chen, Fei and Neidhoefer, James and Upadhyay, Devesh and Fan, Chuchu},
  booktitle={The Fourteenth International Conference on Learning Representations}
}

Authors & Affiliations

Anjali Parashar
MIT REALM Lab