How do you test if an autonomous system is ethical?

When an AI system decides where to install distributed energy resources across a city's power grid, it is making choices that affect who gets reliable electricity, who pays more, and which neighbourhood is protected against blackouts. SEED-SET finds the test scenarios that reveal these behaviors. We explain the key parts of our pipeline using power grid resource management example.

Bayesian Experimental Design Preference learning Multi-objective optimisation LLM-as-evaluator
01 — the problem

Why is ethical benchmarking hard?

Consider an autonomous agent that allocates Distributed Energy Resources (DERs) — solar inverters, batteries, reactive power compensators — across a power network. Before this system is deployed, we want test scenarios that reveal whether its allocation decisions are ethically sound.

This is challenging for three reasons.

1. No universal metric

Ethical behaviour in grid management is multi-dimensional. We care about voltage fairness: are all households receiving stable voltage, or do some neighbourhoods get worse service? We care about cost: how much does the DER deployment cost overall? We care about priority: are underserved or rural buses getting the coverage they need? And we care about resilience: can the network withstand a fault without buses dropping below safe voltage thresholds? These four objectives are in fundamental tension: placing more DERs improves priority but always increases cost.

2. Stakeholder subjectivity

A utility regulator, a community advocate, and a grid engineer weight those same four objectives very differently. A community advocate cares most about voltage fairness for low-income areas; a grid engineer cares most about fault resilience; a budget office cares most about cost. There is no single ground truth, and these preferences cannot be written down analytically. They must be elicited from each stakeholder.

3. Enormous test spaces

A 30-bus IEEE network has a 40-dimensional parameter space: 20 binary DER placement decisions plus 20 reactive power limit values. Evaluating every combination is impossible. We need to be smart about which test scenarios are worth running.

SEED-SET: treat test-scenario selection as a Bayesian experimental design problem. Maintain a probabilistic model of both the system's objective outcomes, such as cost, voltage fairness, etc. and the stakeholder's preferences over these objective outcomes, and select the next test by maximising information gain about both simultaneously.
02 — the pipeline

A two-model loop

SEED-SET maintains two coupled surrogate models, updated after every test evaluation. Click each node below to see what it does.

Test params x
D-dim vector
𝓾
Outcome model
Variational GP
x → y
Outcomes y
M objectives
💬
LLM evaluator
Pairwise label z
Preference model
Pairwise VGP
y → z
Active Inference (Data acquisition)
Select next x
Click a node to learn about that component.

The loop runs for T iterations. At each step, our data acquisition strategy; called Active Inference (AIF) proposes a pair of test candidates (x1, x2). Both are evaluated against the grid simulator, their outcomes are presented to the LLM stakeholder proxy which returns a pairwise label, and both models are updated. The result is a test suite that progressively concentrates on the scenarios a specific stakeholder cares about most.

03 — the outcome model

What the VGP learns about the grid

The outcome model is an independent multitask Variational GP — one VGP per objective. Given a test parameter vector x, it returns a posterior distribution p(y|x) = N(μ(x), diag(σ(x)²)) over the four grid objectives.

For the IEEE 5-bus OPF task, x has 10 dimensions: 5 DER placement indicators (x[0,...,4]) and 5 reactive power limits Q (x[5,...,9]). The widget below uses the analytical ground-truth posterior — what the VGP would learn given sufficient data. Drag the sliders to see how the four objectives respond.

VGP posterior explorer — IEEE 5-bus OPF interactive
Adjust test parameters
0.30
0.15
0.50
0.10
Posterior mean μ(x)
Posterior std σ(x) — uncertainty
How objectives respond to DER density (μ ± σ)

Each line shows μ(x) as DER density varies 0→1 (other sliders fixed). Shaded band = ±σ. The dashed vertical line marks the current slider position.

Notice that Priority rises steeply with DER density. More DERs mean more buses covered. But Cost rises with it too, and they are both competing for the same budget. Fairness peaks at moderate, even DER spread (low variance across buses) rather than at maximum density. These trade-offs are exactly what SEED-SET's test suite is designed to surface.

In a full run: the VGP learns these relationships from evaluations of the real grid simulator (pandapower.runpp). In this blogpost we use the analytical ground truth so you can explore the landscape without waiting for training.
04 — the preference model

Asking "which scenario is better?"

SEED-SET encodes stakeholder values through pairwise comparisons. The LLM receives a prompt describing two DER placement outcomes and the stakeholder's criteria, and responds with 1 or 2. No explicit utility function is required; just a preference between two concrete outcomes.

P(y1 ≻ y2) = Φ( f̂(y1) − f̂(y2) )
Probit likelihood of preferring y₁ over y₂, where f̂ is the latent preference function modelled by PairwiseGP

Below you can build two outcome vectors and see the exact prompt the LLM receives. Pick a stakeholder — notice how the same numerical outcomes lead to different preferences depending on whose criteria are embedded.

Pairwise preference elicitation interactive
Select stakeholder
Outcome 1
Outcome 2
LLM prompt (what the model sees)
Adjust sliders above
Why LLMs instead of humans? You can also involve real stakeholders such as utility commissioners, community representatives, engineers. LLMs serve as scalable proxies that can be queried much faster, and do not face decision fatigue unlike humans, and also do not suffer from self-inconsistency of reporting. The framework is not only model-agnostic, but also evaluator-agnostic: swap the LLM for a human survey interface and the rest of the pipeline is unchanged.
05 — acquisition function

What to test next: the Active Inference criterion

At each iteration, Active Inference Function (AIF) selects the pair of test scenarios that will be most informative. It scores every candidate x by summing three terms:

AIF(x) = I(x; y)  +  I(y; z)  +  𝔼[z]
I(x; y) — outcome mutual information  ·  I(y; z) — preference mutual information  ·  𝔼[z] — expected preference (exploit)

1 Outcome MI  I(x; y) = ½ ∑j log(1 + σj(x)² / η²) — highest in regions where the objective VGP is uncertain. Drives exploration of the objective landscape.

2 Preference MI  I(y; z) — highest near the boundary between preferred and non-preferred outcomes, where the Pairwise VGP is most uncertain about the stakeholder's latent utility. Drives learning of stakeholder values.

3 Expected preference  𝔼[z] = wᵀ μ(x) — the stakeholder's expected score at x under the current model. Pure exploitation toward high-value regions.

The top-2 pair is selected jointly using BoTorch library's optimize_acqf: one pairwise LLM query covers both candidates per iteration. Watch below as the algorithm progressively finds higher-preference DER configurations.

AIF acquisition — live run runs in browser
30
medium
128
06 — stakeholder diversity

Different stakeholders, different test suites

A key result in the paper is that SEED-SET discovers qualitatively different test scenarios depending on whose preferences are embedded. A priority advocate's test suite concentrates on DER configurations that maximise coverage of underserved buses — revealing whether the AI neglects rural areas. A grid engineer's test suite concentrates on fault-resilience edge cases; revealing whether the AI leaves the network vulnerable to voltage collapse.

Both run the same algorithm. The only difference is the pairwise labels used to train the Pairwise VGP. Run the comparison below and watch the two scatter plots diverge. Specifically, note the difference in Priority and Cost values of the scenarios designed by SEED-SET for Priority advocate and another criteria.

Side-by-side stakeholder comparison runs in browser
25
07 — paper results

How does SEED-SET compare to baselines?

The paper evaluates SEED-SET against four baselines (random search, single-GP BO, SVM-based preference learning, and BOPE) across OPF-5, OPF-30, SAR, and Transit tasks.

Candidate quality

SEED-SET finds up to 2× more optimal test candidates than baselines in the same budget. "Optimal" means within the top decile of the ground-truth preference score.

Coverage

SEED-SET achieves 1.25× better coverage of high-dimensional objective space compared to baselines. Coverage matters because a good test suite should map diverse behaviours, not cluster around one mode.

Adapts to a specific criteria

When run with different stakeholder criteria on the same grid, SEED-SET's test suites are measurably non-overlapping. The priority advocate's suite stresses underserved-bus scenarios raising priority of high ethically aligning scenarios. Market operator cares about low cost, and corresponding scenarios favor low cost. Our results easily generalize to muti-stakeholder criteria as well.

What the paper does not do: SEED-SET does not decide whether a system is ethical. It finds the test scenarios most likely to reveal ethical failures according to a specific stakeholder's values. The judgement itself remains with humans.
cite

Citation

@inproceedings{parashar2026seedset, title = {{SEED-SET}: Scalable Evolving Experimental Design for System-level Ethical Testing}, author = {Parashar, Anjali and Li, Yingke and Yu, Eric Yang and Chen, Fei and Neidhoefer, James and Upadhyay, Devesh and Fan, Chuchu}, booktitle = {International Conference on Learning Representations}, year = {2026}, url = {https://arxiv.org/abs/2603.01630} }