Evolution without an Oracle: Driving Effective Evolution with LLM Judges

Zhe Zhao; Yuheng Yang; Haibin Wen; Xiaojie Qiu; Zaixi Zhang; Qingfu Zhang

Evolution without an Oracle: Driving Effective Evolution with LLM Judges

Zhe Zhao, Yuheng Yang, Haibin Wen, Xiaojie Qiu, Zaixi Zhang, Qingfu Zhang

TL;DR

MADE demonstrates that evolution can operate without an objective ground truth by employing an LLM judge guided by Problem Specification decomposed into verifiable sub-requirements ($R$). The framework uses a Creator, Judge, and Requirement Decomposer to form a population and a multi-dimensional fitness signal $v$ plus actionable feedback $s$, enabling a directed, stable search via $V_{LLM}$. Across DevAI, BigGen, InfoBench, and MatPlotBench, MADE achieves substantial gains over strong baselines (e.g., $61.92\%$ vs $39.89\%$ on DevAI and $0.95$ vs $0.72$ all-pass on InfoBench; BigGen average up to $4.90$). This work shifts automated optimization toward describable human-valued qualities, opening avenues for open-ended domains lacking ground truth, while noting computational cost considerations inherent to large language models.

Abstract

The integration of Large Language Models (LLMs) with Evolutionary Computation (EC) has unlocked new frontiers in scientific discovery but remains shackled by a fundamental constraint: the reliance on an Oracle--an objective, machine-computable fitness function. This paper breaks this barrier by asking: Can evolution thrive in a purely subjective landscape governed solely by LLM judges? We introduce MADE (Multi-Agent Decomposed Evolution), a framework that tames the inherent noise of subjective evaluation through "Problem Specification." By decomposing vague instructions into specific, verifiable sub-requirements, MADE transforms high-variance LLM feedback into stable, precise selection pressure. The results are transformative: across complex benchmarks like DevAI and InfoBench, MADE outperforms strong baselines by over 50% in software requirement satisfaction (39.9% to 61.9%) and achieves a 95% perfect pass rate on complex instruction following. This work validates a fundamental paradigm shift: moving from optimizing "computable metrics" to "describable qualities," thereby unlocking evolutionary optimization for the vast open-ended domains where no ground truth exists.

Evolution without an Oracle: Driving Effective Evolution with LLM Judges

TL;DR

Abstract

Evolution without an Oracle: Driving Effective Evolution with LLM Judges

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)

Theorems & Definitions (2)