Table of Contents
Fetching ...

Epistemic Gain, Aleatoric Cost: Uncertainty Decomposition in Multi-Agent Debate for Math Reasoning

Dan Qiao, Binbin Chen, Fengyu Cai, Jianlong Chen, Wenhao Li, Fuxin Jiang, Zuzhi Chen, Hongyuan Zha, Tieying Zhang, Baoxiang Wang

TL;DR

An uncertainty-guided multi-agent reinforcement learning (MARL) algorithm that explicitly optimizes aleatoric noise reduction and epistemic information utilization and training significantly improves post-debate accuracy and stability, and enhances individual reasoning beyond single-agent RL, providing a unified Bayesian uncertainty perspective for understanding and improving MAD.

Abstract

Multi-Agent Debate (MAD) has shown promise in leveraging collective intelligence to improve reasoning and reduce hallucinations, yet it remains unclear how information exchange shapes the underlying ability. Empirically, MAD exhibits paradoxical phenomena, such as accuracy improvement accompanied by substantial increase in token entropy, and remarkable divergence between homogeneous and heterogeneous model combinations. In this paper, we propose a Bayesian uncertainty analysis framework for MAD, which decomposes total predictive uncertainty into epistemic uncertainty reducible by debate context and aleatoric uncertainty induced by internal model noise. Across multiple model configurations, we find that effective debate hinges on achieving high epistemic gain under controlled aleatoric cost. Building on this insight, we design an uncertainty-guided multi-agent reinforcement learning (MARL) algorithm that explicitly optimizes aleatoric noise reduction and epistemic information utilization. Experiments show that our training significantly improves post-debate accuracy and stability, and enhances individual reasoning beyond single-agent RL, providing a unified Bayesian uncertainty perspective for understanding and improving MAD.

Epistemic Gain, Aleatoric Cost: Uncertainty Decomposition in Multi-Agent Debate for Math Reasoning

TL;DR

An uncertainty-guided multi-agent reinforcement learning (MARL) algorithm that explicitly optimizes aleatoric noise reduction and epistemic information utilization and training significantly improves post-debate accuracy and stability, and enhances individual reasoning beyond single-agent RL, providing a unified Bayesian uncertainty perspective for understanding and improving MAD.

Abstract

Multi-Agent Debate (MAD) has shown promise in leveraging collective intelligence to improve reasoning and reduce hallucinations, yet it remains unclear how information exchange shapes the underlying ability. Empirically, MAD exhibits paradoxical phenomena, such as accuracy improvement accompanied by substantial increase in token entropy, and remarkable divergence between homogeneous and heterogeneous model combinations. In this paper, we propose a Bayesian uncertainty analysis framework for MAD, which decomposes total predictive uncertainty into epistemic uncertainty reducible by debate context and aleatoric uncertainty induced by internal model noise. Across multiple model configurations, we find that effective debate hinges on achieving high epistemic gain under controlled aleatoric cost. Building on this insight, we design an uncertainty-guided multi-agent reinforcement learning (MARL) algorithm that explicitly optimizes aleatoric noise reduction and epistemic information utilization. Experiments show that our training significantly improves post-debate accuracy and stability, and enhances individual reasoning beyond single-agent RL, providing a unified Bayesian uncertainty perspective for understanding and improving MAD.
Paper Structure (43 sections, 3 theorems, 57 equations, 6 figures, 2 tables, 2 algorithms)

This paper contains 43 sections, 3 theorems, 57 equations, 6 figures, 2 tables, 2 algorithms.

Key Result

Lemma 3.1

Given an input question $x$, assuming that the initial context is empty and debate context $c_t$ is valid as $p(c_t\mid x)>0$, the log-odds of the binary hypothesis $h\in\{0,1\}$ after debate can be written as

Figures (6)

  • Figure 1: Multi agent debate performance with double Qwen2.5-3B-Instruct models. As accuracy (red line) increases, token entropy (blue dash area) with uncertainty also increases significantly.
  • Figure 2: Overview of uncertainty decomposition in MAD. The purple dashed area represents the system epistemic uncertainty and the gray line represents the total uncertainty. The blue and orange lines represent the accuracy on the test datasets. Top row: Homogeneous multi agent debate. Bottom row: Heterogeneous multi agent debate.
  • Figure 3: Uncertainty dynamics (grey and purple) vs. Accuracy (orange and blue). UMAD training suppresses aleatoric noise and improves both agents' accuracies across debate turns.
  • Figure 4: Trajectory-level pairwise debate rollouts strategy.
  • Figure 5: Homogeneous and Heterogeneous First Turn Responses Flipping Ratio.
  • ...and 1 more figures

Theorems & Definitions (6)

  • Lemma 3.1: Noisy Debate Accuracy Update
  • Proposition 3.2: System-level Uncertainty Decomposition
  • Theorem 3.3: Heterogeneous evidence yields larger epistemic gain
  • proof
  • proof
  • proof