When an AI system decides where to install distributed energy resources across a city's power grid, it is making choices that affect who gets reliable electricity, who pays more, and which neighbourhood is protected against blackouts. SEED-SET finds the test scenarios that reveal these behaviors. We explain the key parts of our pipeline using power grid resource management example.
Consider an autonomous agent that allocates Distributed Energy Resources (DERs) — solar inverters, batteries, reactive power compensators — across a power network. Before this system is deployed, we want test scenarios that reveal whether its allocation decisions are ethically sound.
This is challenging for three reasons.
Ethical behaviour in grid management is multi-dimensional. We care about voltage fairness: are all households receiving stable voltage, or do some neighbourhoods get worse service? We care about cost: how much does the DER deployment cost overall? We care about priority: are underserved or rural buses getting the coverage they need? And we care about resilience: can the network withstand a fault without buses dropping below safe voltage thresholds? These four objectives are in fundamental tension: placing more DERs improves priority but always increases cost.
A utility regulator, a community advocate, and a grid engineer weight those same four objectives very differently. A community advocate cares most about voltage fairness for low-income areas; a grid engineer cares most about fault resilience; a budget office cares most about cost. There is no single ground truth, and these preferences cannot be written down analytically. They must be elicited from each stakeholder.
A 30-bus IEEE network has a 40-dimensional parameter space: 20 binary DER placement decisions plus 20 reactive power limit values. Evaluating every combination is impossible. We need to be smart about which test scenarios are worth running.
SEED-SET maintains two coupled surrogate models, updated after every test evaluation. Click each node below to see what it does.
The loop runs for T iterations. At each step, our data acquisition strategy; called Active Inference (AIF) proposes a pair of test candidates (x1, x2). Both are evaluated against the grid simulator, their outcomes are presented to the LLM stakeholder proxy which returns a pairwise label, and both models are updated. The result is a test suite that progressively concentrates on the scenarios a specific stakeholder cares about most.
The outcome model is an independent multitask Variational GP — one VGP per objective. Given a test parameter vector x, it returns a posterior distribution p(y|x) = N(μ(x), diag(σ(x)²)) over the four grid objectives.
For the IEEE 5-bus OPF task, x has 10 dimensions: 5 DER placement indicators (x[0,...,4]) and 5 reactive power limits Q (x[5,...,9]). The widget below uses the analytical ground-truth posterior — what the VGP would learn given sufficient data. Drag the sliders to see how the four objectives respond.
Notice that Priority rises steeply with DER density. More DERs mean more buses covered. But Cost rises with it too, and they are both competing for the same budget. Fairness peaks at moderate, even DER spread (low variance across buses) rather than at maximum density. These trade-offs are exactly what SEED-SET's test suite is designed to surface.
pandapower.runpp). In this blogpost we use the
analytical ground truth so you can explore the landscape without waiting for training.
SEED-SET encodes stakeholder values through pairwise comparisons.
The LLM receives a prompt describing two DER placement outcomes and the stakeholder's
criteria, and responds with 1 or 2. No explicit utility function
is required; just a preference between two concrete outcomes.
Below you can build two outcome vectors and see the exact prompt the LLM receives. Pick a stakeholder — notice how the same numerical outcomes lead to different preferences depending on whose criteria are embedded.
At each iteration, Active Inference Function (AIF) selects the pair of test scenarios that will be most informative. It scores every candidate x by summing three terms:
1 Outcome MI I(x; y) = ½ ∑j log(1 + σj(x)² / η²) — highest in regions where the objective VGP is uncertain. Drives exploration of the objective landscape.
2 Preference MI I(y; z) — highest near the boundary between preferred and non-preferred outcomes, where the Pairwise VGP is most uncertain about the stakeholder's latent utility. Drives learning of stakeholder values.
3 Expected preference 𝔼[z] = wᵀ μ(x) — the stakeholder's expected score at x under the current model. Pure exploitation toward high-value regions.
The top-2 pair is selected jointly using BoTorch library's optimize_acqf: one pairwise LLM query covers both candidates per iteration.
Watch below as the algorithm progressively finds higher-preference DER configurations.
A key result in the paper is that SEED-SET discovers qualitatively different test scenarios depending on whose preferences are embedded. A priority advocate's test suite concentrates on DER configurations that maximise coverage of underserved buses — revealing whether the AI neglects rural areas. A grid engineer's test suite concentrates on fault-resilience edge cases; revealing whether the AI leaves the network vulnerable to voltage collapse.
Both run the same algorithm. The only difference is the pairwise labels used to train the Pairwise VGP. Run the comparison below and watch the two scatter plots diverge. Specifically, note the difference in Priority and Cost values of the scenarios designed by SEED-SET for Priority advocate and another criteria.
The paper evaluates SEED-SET against four baselines (random search, single-GP BO, SVM-based preference learning, and BOPE) across OPF-5, OPF-30, SAR, and Transit tasks.
SEED-SET finds up to 2× more optimal test candidates than baselines in the same budget. "Optimal" means within the top decile of the ground-truth preference score.
SEED-SET achieves 1.25× better coverage of high-dimensional objective space compared to baselines. Coverage matters because a good test suite should map diverse behaviours, not cluster around one mode.
When run with different stakeholder criteria on the same grid, SEED-SET's test suites are measurably non-overlapping. The priority advocate's suite stresses underserved-bus scenarios raising priority of high ethically aligning scenarios. Market operator cares about low cost, and corresponding scenarios favor low cost. Our results easily generalize to muti-stakeholder criteria as well.