Table of Contents
Fetching ...

Pareto Optimal Code Generation

Gabriel Orlanski, Nicholas Roberts, Aws Albarghouthi, Frederic Sala

TL;DR

This workframes verifier selection as a Pareto optimization problem and empirically map the accuracy-throughput frontier across signals, including the full test suite, heuristics for selective execution, and ORMs, across four Python benchmarks and finds that ORMs with staged verification shift the Pareto frontier outward.

Abstract

Generate-then-rank is the dominant test-time scaling (TTS) paradigm for code generation, but scaling accuracy by sampling and executing more candidates makes comprehensive verification a major computational bottleneck. This creates an inherent trade-off between accuracy and compute that, despite its importance to TTS, is often ignored. Specifically, faster but noisier signals, such as outcome reward models (ORMs), are dismissed as suboptimal. We frame verifier selection as a Pareto optimization problem and empirically map the accuracy-throughput frontier across signals, including the full test suite, heuristics for selective execution, and ORMs, across four Python benchmarks. We show that ORMs are most effective at optimizing the Pareto curve when pruning is used in the generate-then-rank pipeline--known as staged verification--where lightweight filters remove obviously incorrect solutions, including candidates with small syntactic or character-level bugs, before expensive verification. Our pruning analysis shows that eliminating incorrect yet highly ranked candidates (often character-level bugs) prevents wasted compute on incorrect tokens. We find that ORMs with staged verification shift the Pareto frontier outward, achieving 11.64x higher throughput at a cost of 8.26% accuracy relative to full test-suite verification.

Pareto Optimal Code Generation

TL;DR

This workframes verifier selection as a Pareto optimization problem and empirically map the accuracy-throughput frontier across signals, including the full test suite, heuristics for selective execution, and ORMs, across four Python benchmarks and finds that ORMs with staged verification shift the Pareto frontier outward.

Abstract

Generate-then-rank is the dominant test-time scaling (TTS) paradigm for code generation, but scaling accuracy by sampling and executing more candidates makes comprehensive verification a major computational bottleneck. This creates an inherent trade-off between accuracy and compute that, despite its importance to TTS, is often ignored. Specifically, faster but noisier signals, such as outcome reward models (ORMs), are dismissed as suboptimal. We frame verifier selection as a Pareto optimization problem and empirically map the accuracy-throughput frontier across signals, including the full test suite, heuristics for selective execution, and ORMs, across four Python benchmarks. We show that ORMs are most effective at optimizing the Pareto curve when pruning is used in the generate-then-rank pipeline--known as staged verification--where lightweight filters remove obviously incorrect solutions, including candidates with small syntactic or character-level bugs, before expensive verification. Our pruning analysis shows that eliminating incorrect yet highly ranked candidates (often character-level bugs) prevents wasted compute on incorrect tokens. We find that ORMs with staged verification shift the Pareto frontier outward, achieving 11.64x higher throughput at a cost of 8.26% accuracy relative to full test-suite verification.

Paper Structure

This paper contains 39 sections, 3 equations, 6 figures, 17 tables.

Figures (6)

  • Figure 1: Staged Verification Shifts the Pareto Frontier. Each panel shows one generator size (500M to 14B) on CodeContests. Red dashed: non-staged verification. Blue solid: staged verification, which filters before ORM ranking. Gold $\times$: full test suite. The shaded region shows gains from staging; the frontier shifts outward across all generator sizes.
  • Figure 2: Comparison of verification strategies. Normal ranking runs all tests to produce a high-quality ordering but is slow, especially with a large test suite. Staged verification uses a weak verifier to filter obvious failures, then ranks survivors using an ORM.
  • Figure 3: Distribution of filtered candidates by ORM rank. Rank 1 is the top-ranked candidate; rank 128 is the lowest. Rows are filters; columns are datasets. Results shown for the 1.5B ORM; the 500M ORM graph is in \ref{['fig:true-ranking-qc-inst-500m-t10-n128']}.
  • Figure 4: False positive rate and cumulative time for random test subsets (top: 1.5B, bottom: 3B). Error bars show within-problem standard deviation across subset selections.
  • Figure 5: Cost-accuracy Pareto frontier for Qwen 2.5 Coder Instruct models on CodeContests. Top: 7B; bottom: 14B. Lines show Pareto curves for non-staged, 1-test, and 10-test. Costs are per 100k programs (log scale).
  • ...and 1 more figures