Table of Contents
Fetching ...

Revision or Re-Solving? Decomposing Second-Pass Gains in Multi-LLM Pipelines

Jingjie Ning, Xueqi Li, Chengyu Yu

Abstract

Multi-LLM revision pipelines, in which a second model reviews and improves a draft produced by a first, are widely assumed to derive their gains from genuine error correction. We question this assumption with a controlled decomposition experiment that uses four matched conditions to separate second-pass gains into three additive components: re-solving, scaffold, and content. We evaluate this design across two model pairs on three benchmarks spanning knowledge-intensive MCQ and competitive programming. Our results show that the gains of multi-LLM revision are not monolithic, but depend on task structure, draft quality, and the type of draft information. On MCQ tasks, where the answer space is constrained and drafts provide little structural guidance, most gains are consistent with stronger-model re-solving, and directly routing queries to the stronger model can be more effective than revising a weak draft. On code generation tasks, however, two-stage prompting remains useful because even semantically null drafts can provide substantial structural scaffolding, while weak draft content can be harmful. Finally, role-reversed experiments show that strong drafts clearly benefit weak reviewers. Ultimately, our findings demonstrate that the utility of multi-LLM revision is dynamically bottlenecked by task structure and draft quality, necessitating more targeted pipeline designs rather than blanket revision strategies.

Revision or Re-Solving? Decomposing Second-Pass Gains in Multi-LLM Pipelines

Abstract

Multi-LLM revision pipelines, in which a second model reviews and improves a draft produced by a first, are widely assumed to derive their gains from genuine error correction. We question this assumption with a controlled decomposition experiment that uses four matched conditions to separate second-pass gains into three additive components: re-solving, scaffold, and content. We evaluate this design across two model pairs on three benchmarks spanning knowledge-intensive MCQ and competitive programming. Our results show that the gains of multi-LLM revision are not monolithic, but depend on task structure, draft quality, and the type of draft information. On MCQ tasks, where the answer space is constrained and drafts provide little structural guidance, most gains are consistent with stronger-model re-solving, and directly routing queries to the stronger model can be more effective than revising a weak draft. On code generation tasks, however, two-stage prompting remains useful because even semantically null drafts can provide substantial structural scaffolding, while weak draft content can be harmful. Finally, role-reversed experiments show that strong drafts clearly benefit weak reviewers. Ultimately, our findings demonstrate that the utility of multi-LLM revision is dynamically bottlenecked by task structure and draft quality, necessitating more targeted pipeline designs rather than blanket revision strategies.

Paper Structure

This paper contains 41 sections, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Illustration of the decomposition.
  • Figure 2: Signed decomposition of second-pass gains in the (weak $\rightarrow$ strong) setting into re-solving, scaffold, and content. MCQ gains are mostly re-solving, whereas LiveCodeBench is scaffold-dominated with negative content. Error bars show paired 95% CI.
  • Figure 3: Benefit--harm view of the real-draft content effect ($x_2-x_4$). Points below the diagonal indicate net benefit and points above it net harm. MCQ lies near cancellation, whereas LiveCodeBench shifts from net harm to net benefit across settings.
  • Figure 4: LiveCodeBench difficulty split for Pair 1. The content effect becomes increasingly negative from easy to hard.
  • Figure 5: Mechanism-level decomposition of second-pass outcomes in the primary setting. On MCQ, diagnostic cases are dominated by the re-solving family; on LiveCodeBench, they are dominated by scaffold-positive cases.
  • ...and 3 more figures