Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

Sweta Karlekar; Carolina Zheng; Magnus Saebo; Nicolas Beltran-Velez; Shuyang Yu; John Bowlan; Michal Kucer; David Blei

Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu, John Bowlan, Michal Kucer, David Blei

TL;DR

Duel-Evolve is introduced, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates, and shows that pairwise self-preferences provide strong optimization signal for test-time improvement over large, discrete output spaces.

Abstract

Many applications seek to optimize LLM outputs at test time by iteratively proposing, scoring, and refining candidates over a discrete output space. Existing methods use a calibrated scalar evaluator for the target objective to guide search, but for many tasks such scores are unavailable, too sparse, or unreliable. Pairwise comparisons, by contrast, are often easier to elicit, still provide useful signal on improvement directions, and can be obtained from the LLM itself without external supervision. Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates. Duel-Evolve aggregates these noisy candidate comparisons via a Bayesian Bradley-Terry model, yielding uncertainty-aware estimates of candidate quality. These quality estimates guide allocation of the comparison budget toward plausible optima using Double Thompson Sampling, as well as selection of high-quality parents to generate improved candidates. We evaluate Duel-Evolve on MathBench, where it achieves 20 percentage points higher accuracy over existing methods and baselines, and on LiveCodeBench, where it improves over comparable iterative methods by over 12 percentage points. Notably, the method requires no reward model, no ground-truth labels during search, and no hand-crafted scoring function. Results show that pairwise self-preferences provide strong optimization signal for test-time improvement over large, discrete output spaces.

Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

TL;DR

Abstract

Paper Structure (42 sections, 6 equations, 3 figures)

This paper contains 42 sections, 6 equations, 3 figures.

Introduction
Method
Problem Setting
A First Approach: Double Thompson Sampling with Dueling Bandits
Bradley--Terry Model.
Algorithm: Double Thompson Sampling.
From Double Thompson Sampling to Duel-Evolve
Operation (i): Approximating $p(\boldsymbol{\theta} \mid \mathcal{D}_{t})$
Operation (ii): Approximating p*
Putting it together: the Duel-Evolve loop
Related Work
Optimization in discrete spaces.
Dueling bandits.
Experiments
Tasks and Metrics
...and 27 more sections

Figures (3)

Figure 1: Duel-Evolve approximates Double Thompson Sampling (DTS) over a combinatorial space. At round $t$, DTS maintains a posterior $p(\boldsymbol{\theta}\mid D_{1:t})$ over latent utilities $\boldsymbol{\theta}=(\theta_y)_{y\in\mathcal{Y}}$ given comparison history $D_{1:t}=\{(y_i,y_j,c_{ij})\}$, and selects the next duel by sampling $y_a,y_b\sim p_t^*(y)=P\!(y=\arg\max_{y'}\theta_{y'} \mid D_{1:t})$. Duel-Evolve approximates posterior inference by fitting a Bradley--Terry model on the evaluated pool $E_t$ with a Laplace approximation, yielding per-candidate summaries $(\mu_{i,t},\sigma_{i,t})$ (§\ref{['sec:operation-i']}); it then approximates maximizer-focused sampling by Thompson-sampling duels and parents $A_t$ from a pruned survivor set $S_t\subseteq E_t$, and proposing children via a conditioned LLM generator $y\sim p_\phi\!(y\mid x,\{(y_i,\mu_{i,t})\}_{y_i\in A_t})$ (§\ref{['sec:operation-ii']}--§\ref{['sec:algorithm-overview']}).
Figure 2: MathBench accuracy over 150 generations.Left:Duel-Evolve performance stratified by difficulty level (Middle, High, College). Middle: Method comparison over generations: non-iterative baselines (Zero-shot CoT, Few-shot CoT, Self-consistency, and Best-of-$N$) remain flat, while iterative methods (Feedback Descent, GEPA, and Duel-Evolve) improve over time. Right: Final accuracy across methods. Duel-Evolve achieves the best performance.
Figure 3: LiveCodeBench accuracy over 200 generations.Left:Duel-Evolve performance stratified by difficulty level (Easy, Medium, Hard). Middle: Method comparison over generations: static baselines (Zero-shot CoT, Few-shot CoT, Self-consistency, and Best-of-$N$) are flat, while iterative methods (Feedback Descent, GEPA and Duel-Evolve) improve over time. Right: Final accuracy across methods. Duel-Evolve achieves the best performance.

Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

TL;DR

Abstract

Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

Authors

TL;DR

Abstract

Table of Contents

Figures (3)