Table of Contents
Fetching ...

Review Beats Planning: Dual-Model Interaction Patterns for Code Synthesis

Jan Miller

TL;DR

Cross-benchmark validation across 542 problems (HumanEval+ and MBPP+) reveals a moderating variable: review effectiveness scales with specification richness, yielding 4x more improvement on richly-specified problems than on lean ones, while remaining net-positive in both cases.

Abstract

How should two language models interact to produce better code than either can alone? The conventional approach -- a reasoning model plans, a code specialist implements -- seems natural but fails: on HumanEval+, plan-then-code degrades performance by 2.4 percentage points versus the code specialist alone. We show that reversing the interaction changes everything. When the code specialist generates freely and the reasoning model reviews instead of plans, the same two models on the same hardware achieve 90.2% pass@1 -- exceeding GPT-4o (87.2%) and O1 Preview (89.0%) -- on ~$2/hr of commodity GPU. Cross-benchmark validation across 542 problems (HumanEval+ and MBPP+) reveals a moderating variable: review effectiveness scales with specification richness, yielding 4x more improvement on richly-specified problems (+9.8pp) than on lean ones (+2.3pp), while remaining net-positive in both cases. The practical implication is twofold: compose models by their cognitive strengths (reviewers review, coders code), and invest in specification quality to amplify the returns.

Review Beats Planning: Dual-Model Interaction Patterns for Code Synthesis

TL;DR

Cross-benchmark validation across 542 problems (HumanEval+ and MBPP+) reveals a moderating variable: review effectiveness scales with specification richness, yielding 4x more improvement on richly-specified problems than on lean ones, while remaining net-positive in both cases.

Abstract

How should two language models interact to produce better code than either can alone? The conventional approach -- a reasoning model plans, a code specialist implements -- seems natural but fails: on HumanEval+, plan-then-code degrades performance by 2.4 percentage points versus the code specialist alone. We show that reversing the interaction changes everything. When the code specialist generates freely and the reasoning model reviews instead of plans, the same two models on the same hardware achieve 90.2% pass@1 -- exceeding GPT-4o (87.2%) and O1 Preview (89.0%) -- on ~$2/hr of commodity GPU. Cross-benchmark validation across 542 problems (HumanEval+ and MBPP+) reveals a moderating variable: review effectiveness scales with specification richness, yielding 4x more improvement on richly-specified problems (+9.8pp) than on lean ones (+2.3pp), while remaining net-positive in both cases. The practical implication is twofold: compose models by their cognitive strengths (reviewers review, coders code), and invest in specification quality to amplify the returns.
Paper Structure (23 sections, 5 figures, 4 tables)

This paper contains 23 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Two dual-model interaction patterns. (a) Plan-then-code: the reasoning model plans upfront, but its abstract guidance can mislead the coder (75.6%). (b) Review-then-fix: the coder generates freely, the reasoning model reviews for bugs, and the coder fixes---achieving 87.8% without retry.
  • Figure 2: Pipeline configurations on HumanEval+. Plan-then-code (red) performs worse than the raw coder baseline. Review-then-fix (green) exceeds both GPT-4o and O1 Preview. Dashed line: coder baseline; dotted lines: proprietary model reference points.
  • Figure 3: Review effectiveness scales with specification richness. The 4$\times$ gap between rich specs (HumanEval+, +9.8pp) and lean specs (MBPP+, +2.3pp) persists even with spec enrichment. Green: HumanEval+ (rich); gray: MBPP+ (lean).
  • Figure 4: Mitigation attempts for the spec-richness gap. Left: HumanEval+ (rich specs). Right: MBPP+ (lean specs). Neither spec-gated review (purple) nor spec-enriched review (blue) outperforms unconditional review (green). Dashed line: coder-alone baseline.
  • Figure 5: Breakdown of 15 regressions in plan-then-code: tasks the coder solved alone but failed with planner guidance. Import contamination dominates---the planner suggests libraries the coder then uses without importing.