Table of Contents
Fetching ...

DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning

Gaurav Srivastava, Zhenyu Bi, Meng Lu, Xuan Wang

TL;DR

This work tackles the data bottleneck limiting progression of LLM reasoning by proposing Debate–Train–Evolve (DTE), a ground-truth-free framework that learns from multi-agent debate traces to autonomously evolve a single model. It introduces Reflect-Critique-Refine (RCR) prompting to improve debate quality and harnesses Group Relative Policy Optimization (GRPO) to distill debate insights into a single policy without a value function, enabling efficient inference post-evolution. Empirically, DTE achieves an average GSM-PLUS accuracy gain of 8.92% and demonstrates strong cross-domain generalization to ARC and CommonsenseQA, indicating it captures general reasoning capabilities beyond dataset-specific patterns. The approach balances the benefits of MAD with single-model efficiency, though it notes challenges like catastrophic forgetting in smaller models and higher training costs, suggesting avenues for further optimization and broader task applicability.

Abstract

Large language models (LLMs) have improved significantly in their reasoning through extensive training on massive datasets. However, relying solely on additional data for improvement is becoming increasingly impractical, highlighting the need for models to autonomously enhance their reasoning without external supervision. In this paper, we propose Debate, Train, Evolve (DTE), a novel ground truth-free training framework that uses multi-agent debate traces to evolve a single language model. We also introduce a new prompting strategy Reflect-Critique-Refine, to improve debate quality by explicitly instructing agents to critique and refine their reasoning. Extensive evaluations on seven reasoning benchmarks with six open-weight models show that our DTE framework achieve substantial improvements, with an average accuracy gain of 8.92% on the challenging GSM-PLUS dataset. Furthermore, we observe strong cross-domain generalization, with an average accuracy gain of 5.8% on all other benchmarks, suggesting that our method captures general reasoning capabilities. Our framework code and trained models are publicly available at https://github.com/ctrl-gaurav/Debate-Train-Evolve

DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning

TL;DR

This work tackles the data bottleneck limiting progression of LLM reasoning by proposing Debate–Train–Evolve (DTE), a ground-truth-free framework that learns from multi-agent debate traces to autonomously evolve a single model. It introduces Reflect-Critique-Refine (RCR) prompting to improve debate quality and harnesses Group Relative Policy Optimization (GRPO) to distill debate insights into a single policy without a value function, enabling efficient inference post-evolution. Empirically, DTE achieves an average GSM-PLUS accuracy gain of 8.92% and demonstrates strong cross-domain generalization to ARC and CommonsenseQA, indicating it captures general reasoning capabilities beyond dataset-specific patterns. The approach balances the benefits of MAD with single-model efficiency, though it notes challenges like catastrophic forgetting in smaller models and higher training costs, suggesting avenues for further optimization and broader task applicability.

Abstract

Large language models (LLMs) have improved significantly in their reasoning through extensive training on massive datasets. However, relying solely on additional data for improvement is becoming increasingly impractical, highlighting the need for models to autonomously enhance their reasoning without external supervision. In this paper, we propose Debate, Train, Evolve (DTE), a novel ground truth-free training framework that uses multi-agent debate traces to evolve a single language model. We also introduce a new prompting strategy Reflect-Critique-Refine, to improve debate quality by explicitly instructing agents to critique and refine their reasoning. Extensive evaluations on seven reasoning benchmarks with six open-weight models show that our DTE framework achieve substantial improvements, with an average accuracy gain of 8.92% on the challenging GSM-PLUS dataset. Furthermore, we observe strong cross-domain generalization, with an average accuracy gain of 5.8% on all other benchmarks, suggesting that our method captures general reasoning capabilities. Our framework code and trained models are publicly available at https://github.com/ctrl-gaurav/Debate-Train-Evolve

Paper Structure

This paper contains 54 sections, 6 equations, 6 figures, 35 tables, 2 algorithms.

Figures (6)

  • Figure 1: Overview of the proposed Debate–Train–Evolve framework. Left-Debate: Several agents debate until they converge on a consensus (green ✓) or expose a wrong path (red ✗). Centre-Train: we remove pure debate elements, keep the high-quality reasoning traces and consensus answer, and use them to fine-tune a single policy with GRPO. Right-Evolve: the evolved agent replaces its earlier self, so future inference require just one forward pass yet they outperform the committee on maths, science, and commonsense benchmarks.
  • Figure 2: Accuracy vs. evolution round.
  • Figure 3: Results (%) on: GSM8K, GSM-PLUS, and ARC-Challenge datasets. Performance is compared across three evaluation settings: single model inference, the Original Multi-Agent Debate (MAD@3) prompt, and our proposed RCR (RCR-MAD (Ours)@3) prompting.
  • Figure 4: Scaling up agents Accuracy of four Qwen model sizes as the number of agents grows from 1-7.
  • Figure 5: Diminishing returns in GRPO updates after 8K steps. GSM-Plus accuracy for five models as a function of the number of training steps during GRPO.
  • ...and 1 more figures