Table of Contents
Fetching ...

Multi-Agent Collaborative Intelligence: Dual-Dial Control for Reliable LLM Reasoning

Edward Y. Chang, Ethan Y. Chang

TL;DR

MACI presents a dual-dial controller for multi-agent LLM reasoning that separately gates evidence quality and controls interaction intensity, with a moderator that stops debates when progress plateaus. The method comes with theory-lite guarantees (nonincreasing dispersion and plateau termination in $O(1/cepsilon)$ rounds, and $ ilde{O}(\\sqrt{KT})$ no-regret for a budgeted scheduler) and demonstrates improved accuracy and calibration while reducing tokens in clinical diagnosis and news-bias tasks. A cross-family evaluator (CRIT) provides robust, non-oracle judgment that supports soft weighting and stopping, with stability validated under judge swaps when high-capability models are used. MACI also translates residual uncertainty into precision RAG plans to guide what to retrieve next, showing portability across domains without domain-specific tuning. Overall, MACI reframes multi-agent debate as a budget-aware, measurable, and provably terminating controller that balances exploration and consolidation through principled information-theoretic signals.

Abstract

Multi-agent debate often wastes compute by using a fixed adversarial stance, aggregating without deliberation, or stopping on heuristics. We introduce MACI, an active controller with two independent dials that decouple information from behavior: an information dial that gates evidence by quality, and a behavior dial that schedules contentiousness from exploration to consolidation. A moderator tracks disagreement, overlap, evidence quality, and argument quality, and halts when gains plateau. We provide theory-lite guarantees for nonincreasing dispersion and provable termination, with a budget-feasible scheduler. Across clinical diagnosis and news-bias tasks, MACI improves accuracy and calibration while reducing tokens, and converts residual uncertainty into precision RAG plans that specify what to retrieve next. We use a cross-family LLM judge (CRIT) as a conservative soft weight and stop signal, validated for order invariance and judge-swap stability; stability depends on using high-capability judges. MACI turns debate into a budget-aware, measurable, and provably terminating controller.

Multi-Agent Collaborative Intelligence: Dual-Dial Control for Reliable LLM Reasoning

TL;DR

MACI presents a dual-dial controller for multi-agent LLM reasoning that separately gates evidence quality and controls interaction intensity, with a moderator that stops debates when progress plateaus. The method comes with theory-lite guarantees (nonincreasing dispersion and plateau termination in rounds, and no-regret for a budgeted scheduler) and demonstrates improved accuracy and calibration while reducing tokens in clinical diagnosis and news-bias tasks. A cross-family evaluator (CRIT) provides robust, non-oracle judgment that supports soft weighting and stopping, with stability validated under judge swaps when high-capability models are used. MACI also translates residual uncertainty into precision RAG plans to guide what to retrieve next, showing portability across domains without domain-specific tuning. Overall, MACI reframes multi-agent debate as a budget-aware, measurable, and provably terminating controller that balances exploration and consolidation through principled information-theoretic signals.

Abstract

Multi-agent debate often wastes compute by using a fixed adversarial stance, aggregating without deliberation, or stopping on heuristics. We introduce MACI, an active controller with two independent dials that decouple information from behavior: an information dial that gates evidence by quality, and a behavior dial that schedules contentiousness from exploration to consolidation. A moderator tracks disagreement, overlap, evidence quality, and argument quality, and halts when gains plateau. We provide theory-lite guarantees for nonincreasing dispersion and provable termination, with a budget-feasible scheduler. Across clinical diagnosis and news-bias tasks, MACI improves accuracy and calibration while reducing tokens, and converts residual uncertainty into precision RAG plans that specify what to retrieve next. We use a cross-family LLM judge (CRIT) as a conservative soft weight and stop signal, validated for order invariance and judge-swap stability; stability depends on using high-capability judges. MACI turns debate into a budget-aware, measurable, and provably terminating controller.

Paper Structure

This paper contains 187 sections, 4 theorems, 40 equations, 5 figures, 21 tables, 2 algorithms.

Key Result

Proposition 1

Let $T$ be the number of rounds executed before the plateau test fires or the budget is exhausted. Under the assumptions above, the expected regret of Algorithm alg:bf-ucb with respect to the best fixed action $a^\star\in\mathcal{A}$ that satisfies the budget is and the expected budget violation is zero by construction of $S_t$.

Figures (5)

  • Figure 1: Internal signals over debate rounds (clinical). Left: entropy declines under scheduled $\mathrm{CL}$. Middle: evidence quality $\mathit{Q}$ rises as the evidence gate $\tau_{\mathit{Q}}$ tightens. Right: argument quality $\mathrm{CRIT}$ rises as low-quality arguments are filtered. Termination coincides with plateaued IG and low dispersion (not shown).
  • Figure 2: Convergence in two cases: $D_{\mathrm{JS}}$ and allied distances decrease monotonically under scheduling.
  • Figure 3: Convergence signals during bias mitigation (news bias). Wasserstein distance falls, normalized MI rises and then plateaus, and cross-entropy declines. Debates stop when dispersion and information gains plateau, mirroring the clinical setting (Appx. \ref{['app:news_bias']}).
  • Figure 4: Annotator rating distributions. Left: Democrat scandals. Right: Republican scandals. Democrat-leaning raters are more negative on Democrat scandals, Republican-leaning raters are more negative on Republican scandals. The typical gap is about one class step.
  • Figure 5: Convergence during bias debates. Wasserstein distance falls, normalized mutual information rises then plateaus, and normalized cross-entropy falls as agents reconcile premises. The same pattern that drives consolidation in diagnosis appears here.

Theorems & Definitions (8)

  • Proposition 1: No-regret versus best fixed schedule
  • proof
  • Lemma 1: Monotonicity under gated averaging
  • proof
  • Proposition 2: Termination in $O(1/\varepsilon)$ expected rounds
  • proof
  • Corollary 1: Geometric contraction yields $O(\log(1/\varepsilon))$
  • proof