Table of Contents
Fetching ...

Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment

Ankur Samanta, Akshayaa Magesh, Youliang Yu, Runzhe Wu, Ayush Jain, Daniel Jiang, Boris Vidolov, Paul Sajda, Yonathan Efroni, Kaveh Hassani

TL;DR

This paper formalizes self-consistency as an intrinsic property of well-aligned reasoning in language models and introduces MACA, a post-training RL framework where multiple LM clones debate to ground reasoning in peer arguments. By training on debate-derived consensus signals (majority/minority trajectories), MACA provides richer supervision than single-round majority voting, leading to large gains in self-consistency (+27.6% on GSM8K) and problem-solving performance across several benchmarks, with notable generalization to unseen domains (+16.3% GPQA, +11.6% CSQA). The approach leverages four preference-based and imitation-learning objectives (MV-DPO, MV-KTO, MV-GRPO, MV-SFT) to align internal reasoning with consensus, improving both single-agent and multi-agent inference, as well as ensemble decision-making (up to +42.7% on MathQA). These results demonstrate that consensus-based post-training can unlock latent reasoning capabilities, enabling more reliable, concise, and robust reasoning without external supervision, while highlighting avenues for future work in heterogeneity, confidence weighting, and broader task coverage.

Abstract

Language Models (LMs) are inconsistent reasoners, often generating contradictory responses to identical prompts. While inference-time methods can mitigate these inconsistencies, they fail to address the core problem: LMs struggle to reliably select reasoning pathways leading to consistent outcomes under exploratory sampling. To address this, we formalize self-consistency as an intrinsic property of well-aligned reasoning models and introduce Multi-Agent Consensus Alignment (MACA), a reinforcement learning framework that post-trains models to favor reasoning trajectories aligned with their internal consensus using majority/minority outcomes from multi-agent debate. These trajectories emerge from deliberative exchanges where agents ground reasoning in peer arguments, not just aggregation of independent attempts, creating richer consensus signals than single-round majority voting. MACA enables agents to teach themselves to be more decisive and concise, and better leverage peer insights in multi-agent settings without external supervision, driving substantial improvements across self-consistency (+27.6% on GSM8K), single-agent reasoning (+23.7% on MATH), sampling-based inference (+22.4% Pass@20 on MATH), and multi-agent ensemble decision-making (+42.7% on MathQA). These findings, coupled with strong generalization to unseen benchmarks (+16.3% on GPQA, +11.6% on CommonsenseQA), demonstrate robust self-alignment that more reliably unlocks latent reasoning potential of language models.

Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment

TL;DR

This paper formalizes self-consistency as an intrinsic property of well-aligned reasoning in language models and introduces MACA, a post-training RL framework where multiple LM clones debate to ground reasoning in peer arguments. By training on debate-derived consensus signals (majority/minority trajectories), MACA provides richer supervision than single-round majority voting, leading to large gains in self-consistency (+27.6% on GSM8K) and problem-solving performance across several benchmarks, with notable generalization to unseen domains (+16.3% GPQA, +11.6% CSQA). The approach leverages four preference-based and imitation-learning objectives (MV-DPO, MV-KTO, MV-GRPO, MV-SFT) to align internal reasoning with consensus, improving both single-agent and multi-agent inference, as well as ensemble decision-making (up to +42.7% on MathQA). These results demonstrate that consensus-based post-training can unlock latent reasoning capabilities, enabling more reliable, concise, and robust reasoning without external supervision, while highlighting avenues for future work in heterogeneity, confidence weighting, and broader task coverage.

Abstract

Language Models (LMs) are inconsistent reasoners, often generating contradictory responses to identical prompts. While inference-time methods can mitigate these inconsistencies, they fail to address the core problem: LMs struggle to reliably select reasoning pathways leading to consistent outcomes under exploratory sampling. To address this, we formalize self-consistency as an intrinsic property of well-aligned reasoning models and introduce Multi-Agent Consensus Alignment (MACA), a reinforcement learning framework that post-trains models to favor reasoning trajectories aligned with their internal consensus using majority/minority outcomes from multi-agent debate. These trajectories emerge from deliberative exchanges where agents ground reasoning in peer arguments, not just aggregation of independent attempts, creating richer consensus signals than single-round majority voting. MACA enables agents to teach themselves to be more decisive and concise, and better leverage peer insights in multi-agent settings without external supervision, driving substantial improvements across self-consistency (+27.6% on GSM8K), single-agent reasoning (+23.7% on MATH), sampling-based inference (+22.4% Pass@20 on MATH), and multi-agent ensemble decision-making (+42.7% on MathQA). These findings, coupled with strong generalization to unseen benchmarks (+16.3% on GPQA, +11.6% on CommonsenseQA), demonstrate robust self-alignment that more reliably unlocks latent reasoning potential of language models.

Paper Structure

This paper contains 58 sections, 6 equations, 12 figures, 22 tables, 1 algorithm.

Figures (12)

  • Figure 1: Multi-Agent Consensus Alignment framework: Multiple clones of a base LM engage in debate to generate majority and minority reasoning trajectories through multi-agent debate. The framework splits responses based on alignment with majority consensus to create preference pairs. MV-GRPO compares online samples against majority signals, while MV-SFT imitates majority traces directly. In contrast, MV-DPO and MV-KTO utilize both positive (majority) and negative (minority) examples to learn relative separation between these preference pairs. Updated agents can then be used for single-agent or multi-agent inference, or continue iterative training.
  • Figure 2: Consistency before and after MACA post-training. Pre-trained models (Orange) show low sampling consistency across sampled trajectories. Post-training with MACA (Blue) substantially improves sampling consistency. Averaged over 500 test prompts with 20 trajectories each.
  • Figure 3: Post-training self-consistency improves sampling accuracy. Dashed: Pass@t (oracle upper bound), solid: MV@t (majority over $t$ samples), dotted: greedy ($\tau=0$) accuracy. (Blue): post-trained model. (Orange): base model. Curves computed over 500 prompts.
  • Figure 4: Debate-aware RL improves all stages of multi-agent debate. Incorporating debate context in RL teaches agents to leverage prior arguments, improving final-round consensus. Stages: initial round average, initial round majority vote, final round average, final round majority vote.
  • Figure 5: Self-consistency improvements persist without token constraints. Models trained with 256-token debates still show gains when tested with full-length responses, though with reduced effect sizes due to the weaker training signal compared to testing conditions. Colors: Blue: post-trained model, Orange: base model.
  • ...and 7 more figures