Table of Contents
Fetching ...

A Judge Agent Closes the Reliability Gap in AI-Generated Scientific Simulation

Chengshuai Yang

Abstract

Large language models can generate scientific simulation code, but the generated code silently fails on most non-textbook problems. We show that classical mathematical validation -- well-posedness, convergence, and error certification -- can be fully automated by a Judge Agent, reducing the silent-failure rate from 42% to 1.5% across 134 test cases spanning 12 scientific domains. The headline result comes from a prospective benchmark: 72 blinded tasks submitted by 12 independent scientists yield an 89% success rate (95% CI: [80%, 95%]) with automated error bounds, versus 53% without the Judge. On clinical CT (the only powered experiment, n = 200), the pipeline reaches 99% of expert quality. The residual 1.5% concentrates at bifurcation points where certifiability breaks down. We formalize this boundary through the simulability class S and introduce spec.md, a structured specification format that makes any scientific computation problem machine-readable and solver-independent. Code, data, and all 72 benchmark tasks are publicly archived.

A Judge Agent Closes the Reliability Gap in AI-Generated Scientific Simulation

Abstract

Large language models can generate scientific simulation code, but the generated code silently fails on most non-textbook problems. We show that classical mathematical validation -- well-posedness, convergence, and error certification -- can be fully automated by a Judge Agent, reducing the silent-failure rate from 42% to 1.5% across 134 test cases spanning 12 scientific domains. The headline result comes from a prospective benchmark: 72 blinded tasks submitted by 12 independent scientists yield an 89% success rate (95% CI: [80%, 95%]) with automated error bounds, versus 53% without the Judge. On clinical CT (the only powered experiment, n = 200), the pipeline reaches 99% of expert quality. The residual 1.5% concentrates at bifurcation points where certifiability breaks down. We formalize this boundary through the simulability class S and introduce spec.md, a structured specification format that makes any scientific computation problem machine-readable and solver-independent. Code, data, and all 72 benchmark tasks are publicly archived.

Paper Structure

This paper contains 31 sections, 4 theorems, 6 equations, 5 figures, 22 tables.

Key Result

Proposition 2

Every problem $\mathcal{P} \in \mathfrak{S}$ admits a representation as a valid spec.md file. Conversely, every valid spec.md file defines a problem in $\mathfrak{S}$.

Figures (5)

  • Figure 1: Three-agent pipeline grounded in the simulability class $\mathfrak{S}$. The Plan Agent translates natural language into a spec.md file (realizing S1). The Judge Agent verifies S2--S4 via 5 pre-execution gates and audits the solution via post-execution quality checks (realizing S4). Rejection triggers redesign (up to 3 rounds) or a certificate that the problem lies outside $\mathfrak{S}$.
  • Figure 2: spec.md format and a concrete example. (a) Generic structure: 6 mandatory sections encode the S1 six-tuple $\mathcal{S} = (\Omega, \mathcal{E}, \mathcal{B}, \mathcal{I}, \mathcal{O}, \varepsilon)$; 2 optional sections link to the primitive basis (Proposition \ref{['prop:realizability']}) and benchmark variations. The format uses YAML-compatible key-value syntax within Markdown section headers. (b) A real spec.md file for CT reconstruction from the benchmark archive. All 72 prospective tasks and 12 development problems are archived in this format. Formal grammar in Supplementary Section S8.
  • Figure 3: The Judge Agent's contribution grows with problem difficulty. Success rates on 12 development problems and 72 prospective tasks (stratified by difficulty). Without the Judge, the success rate drops from 89% to 53% overall, and from 79% to 29% on frontier tasks---those closest to $\partial \mathfrak{S}$. The gap widens monotonically with proximity to the boundary.
  • Figure 4: CT reconstruction comparison (modified Shepp--Logan phantom shepp1974, 128 projections, Poisson noise, $n = 200$). (a) Ground truth. (b) Framework with Judge ($\in \mathfrak{S}$ verified): correct angular model, Ram-Lak filter, non-negativity enforced. (c) Without Judge: wrong angular range (360° assumed instead of 180°), producing doubled edges and streak artifacts; 5,326 non-negativity violations. (d) PSNR distribution: with Judge $25.6 \pm 0.3$ dB vs. without Judge $13.1 \pm 0.1$ dB ($n = 200$, $p < 10^{-30}$, Wilcoxon).
  • Figure 5: Failure analysis across 134 test cases. (a)$\mathfrak{S}$-verification reduces silent failures in two stages: pre-execution gates (42% $\to$ 6%) and post-execution quality audit (6% $\to$ 1.5%). (b) Quality ratio vs. estimated Lipschitz constant $\hat{L}_{\text{DAG}}$ (log scale). The two residual failures (red) occur at $\partial \mathfrak{S}$---bifurcation points where $\hat{L}_{\text{DAG}} \to \infty$---while all interior-$\mathfrak{S}$ problems (blue) achieve $\geq 75\%$ quality. Orange triangles: flagged by quality audit. The dashed line indicates the empirical boundary where certifiability (S4) degrades.

Theorems & Definitions (7)

  • Definition 1: Simulability class $\mathfrak{S}$
  • Proposition 2: spec.md Completeness
  • Proposition 3: Primitive Realizability
  • Conjecture 4: Obstruction Completeness
  • Proposition S5: Automated Bounded-Error Realizability within $\mathfrak{S}$
  • proof
  • Proposition S6: Obstructions at the Scientific Event Horizon