Table of Contents
Fetching ...

Evolution without an Oracle: Driving Effective Evolution with LLM Judges

Zhe Zhao, Yuheng Yang, Haibin Wen, Xiaojie Qiu, Zaixi Zhang, Qingfu Zhang

TL;DR

MADE demonstrates that evolution can operate without an objective ground truth by employing an LLM judge guided by Problem Specification decomposed into verifiable sub-requirements ($R$). The framework uses a Creator, Judge, and Requirement Decomposer to form a population and a multi-dimensional fitness signal $v$ plus actionable feedback $s$, enabling a directed, stable search via $V_{LLM}$. Across DevAI, BigGen, InfoBench, and MatPlotBench, MADE achieves substantial gains over strong baselines (e.g., $61.92\%$ vs $39.89\%$ on DevAI and $0.95$ vs $0.72$ all-pass on InfoBench; BigGen average up to $4.90$). This work shifts automated optimization toward describable human-valued qualities, opening avenues for open-ended domains lacking ground truth, while noting computational cost considerations inherent to large language models.

Abstract

The integration of Large Language Models (LLMs) with Evolutionary Computation (EC) has unlocked new frontiers in scientific discovery but remains shackled by a fundamental constraint: the reliance on an Oracle--an objective, machine-computable fitness function. This paper breaks this barrier by asking: Can evolution thrive in a purely subjective landscape governed solely by LLM judges? We introduce MADE (Multi-Agent Decomposed Evolution), a framework that tames the inherent noise of subjective evaluation through "Problem Specification." By decomposing vague instructions into specific, verifiable sub-requirements, MADE transforms high-variance LLM feedback into stable, precise selection pressure. The results are transformative: across complex benchmarks like DevAI and InfoBench, MADE outperforms strong baselines by over 50% in software requirement satisfaction (39.9% to 61.9%) and achieves a 95% perfect pass rate on complex instruction following. This work validates a fundamental paradigm shift: moving from optimizing "computable metrics" to "describable qualities," thereby unlocking evolutionary optimization for the vast open-ended domains where no ground truth exists.

Evolution without an Oracle: Driving Effective Evolution with LLM Judges

TL;DR

MADE demonstrates that evolution can operate without an objective ground truth by employing an LLM judge guided by Problem Specification decomposed into verifiable sub-requirements (). The framework uses a Creator, Judge, and Requirement Decomposer to form a population and a multi-dimensional fitness signal plus actionable feedback , enabling a directed, stable search via . Across DevAI, BigGen, InfoBench, and MatPlotBench, MADE achieves substantial gains over strong baselines (e.g., vs on DevAI and vs all-pass on InfoBench; BigGen average up to ). This work shifts automated optimization toward describable human-valued qualities, opening avenues for open-ended domains lacking ground truth, while noting computational cost considerations inherent to large language models.

Abstract

The integration of Large Language Models (LLMs) with Evolutionary Computation (EC) has unlocked new frontiers in scientific discovery but remains shackled by a fundamental constraint: the reliance on an Oracle--an objective, machine-computable fitness function. This paper breaks this barrier by asking: Can evolution thrive in a purely subjective landscape governed solely by LLM judges? We introduce MADE (Multi-Agent Decomposed Evolution), a framework that tames the inherent noise of subjective evaluation through "Problem Specification." By decomposing vague instructions into specific, verifiable sub-requirements, MADE transforms high-variance LLM feedback into stable, precise selection pressure. The results are transformative: across complex benchmarks like DevAI and InfoBench, MADE outperforms strong baselines by over 50% in software requirement satisfaction (39.9% to 61.9%) and achieves a 95% perfect pass rate on complex instruction following. This work validates a fundamental paradigm shift: moving from optimizing "computable metrics" to "describable qualities," thereby unlocking evolutionary optimization for the vast open-ended domains where no ground truth exists.

Paper Structure

This paper contains 27 sections, 2 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Qualitative demonstration of the optimization effect of our proposed multi-agent evolutionary framework on a PDE solution visualization task. The figure displays, from top to bottom: (a) a low-fitness solution from the initial population; (b) the high-fitness solution obtained after evolutionary iteration; (c) the ground-truth reference image for the task. The initially generated code (a) exhibited fundamental visualization defects, rendering the "True data," "Prediction," and "Error" fields merely as discrete scatter points on the mesh, resulting in a discontinuous and noisy representation. After iterative optimization with our "MADE", the final generated solution (b) successfully corrects the plotting logic, evolving from a scatter plot to a smooth continuous heatmap interpolation. This optimized result accurately renders the field distribution and error magnitude, demonstrating a visual style and data representation that are highly consistent with the reference image (c). This example powerfully demonstrates that our framework can guide the code evolution from a rudimentary discrete rendering to a high-quality, continuous scientific visualization.
  • Figure 2: The standard deviation range of the LLM Judge under input perturbation and high temperature.
  • Figure 3: The change in the LLM Judge's scoring standard deviation with respect to the mean score under input perturbation and high temperature.

Theorems & Definitions (2)

  • Definition 3.1: Structured Fitness Function $\Phi_{\text{LLM}}$
  • Definition 3.2: LLM-Guided Directional Mutation Operator $V_{\text{LLM}}$