Table of Contents
Fetching ...

On the Uncertainty of Large Language Model-Based Multi-Agent Systems

Yuxuan Zhao, Sijia Chen, Ningxin Su

TL;DR

This paper investigates why MAS built on open-source LLMs succeed or fail by systematically analyzing uncertainty dynamics across token-, trajectory-, and round-levels in six benchmarks and four coordination topologies. The authors find that single-agent systems often outperform MAS (43.3% of cases), and that the critical determinants of MAS performance arise in the first interaction round, with peak entropy generally detrimental. They distill three principles—Certainty Preference, Base Uncertainty, and Task Awareness—and introduce the Entropy Judger, a lightweight classifier that predicts per-sample correctness from entropy traces and enables pass@$k$ selection without ground-truth labels. The work emphasizes that uncertainty is a principled lens for diagnosing MAS failures and guiding architectural choices, offering both theoretical insights and practical tooling for robust multi-agent reasoning with LLMs. It further shows that RL-tuned bases can invert typical entropy effects, enabling MAS to outperform SAS under certain conditions and tasks.

Abstract

Multi-agent systems (MAS) have emerged as a prominent paradigm for leveraging large language models (LLMs) to tackle complex tasks. However, the mechanisms governing the effectiveness of MAS built upon publicly available LLMs, specifically the underlying rationales for their success or failure, remain largely unexplored. In this paper, we revisit MAS through the perspective of uncertainty, considering both intra- and inter-agent dynamics by investigating entropy transitions during problem-solving across various topologies and six benchmark tasks. By analyzing 245 features spanning token-, trajectory-, and round-level entropy, we counterintuitively find that a single agent outperforms MAS in approximately 43.3% of cases, and that uncertainty dynamics are largely determined during the first round of interaction. Furthermore, we provide three key observations: 1) Certainty Preference: reducing uncertainty at any stage for any agent is critical for guaranteeing correct solutions; 2) Base Uncertainty: base models with lower entropy during problem-solving directly benefit MAS performance; and 3) Task Awareness: entropy dynamics of MAS play varying roles across different tasks. Building on these insights, we introduce a simple yet effective algorithm, the Entropy Judger, to select solutions from MAS's pass@k results, leading to consistent accuracy improvements across all MAS configurations and tasks. Our source code is available at https://github.com/AgenticFinLab/multiagent-entropy.

On the Uncertainty of Large Language Model-Based Multi-Agent Systems

TL;DR

This paper investigates why MAS built on open-source LLMs succeed or fail by systematically analyzing uncertainty dynamics across token-, trajectory-, and round-levels in six benchmarks and four coordination topologies. The authors find that single-agent systems often outperform MAS (43.3% of cases), and that the critical determinants of MAS performance arise in the first interaction round, with peak entropy generally detrimental. They distill three principles—Certainty Preference, Base Uncertainty, and Task Awareness—and introduce the Entropy Judger, a lightweight classifier that predicts per-sample correctness from entropy traces and enables pass@ selection without ground-truth labels. The work emphasizes that uncertainty is a principled lens for diagnosing MAS failures and guiding architectural choices, offering both theoretical insights and practical tooling for robust multi-agent reasoning with LLMs. It further shows that RL-tuned bases can invert typical entropy effects, enabling MAS to outperform SAS under certain conditions and tasks.

Abstract

Multi-agent systems (MAS) have emerged as a prominent paradigm for leveraging large language models (LLMs) to tackle complex tasks. However, the mechanisms governing the effectiveness of MAS built upon publicly available LLMs, specifically the underlying rationales for their success or failure, remain largely unexplored. In this paper, we revisit MAS through the perspective of uncertainty, considering both intra- and inter-agent dynamics by investigating entropy transitions during problem-solving across various topologies and six benchmark tasks. By analyzing 245 features spanning token-, trajectory-, and round-level entropy, we counterintuitively find that a single agent outperforms MAS in approximately 43.3% of cases, and that uncertainty dynamics are largely determined during the first round of interaction. Furthermore, we provide three key observations: 1) Certainty Preference: reducing uncertainty at any stage for any agent is critical for guaranteeing correct solutions; 2) Base Uncertainty: base models with lower entropy during problem-solving directly benefit MAS performance; and 3) Task Awareness: entropy dynamics of MAS play varying roles across different tasks. Building on these insights, we introduce a simple yet effective algorithm, the Entropy Judger, to select solutions from MAS's pass@k results, leading to consistent accuracy improvements across all MAS configurations and tasks. Our source code is available at https://github.com/AgenticFinLab/multiagent-entropy.
Paper Structure (118 sections, 1 equation, 21 figures, 4 tables)

This paper contains 118 sections, 1 equation, 21 figures, 4 tables.

Figures (21)

  • Figure 1: Accuracy comparison of SAS and MAS across models and datasets. For brevity, LLaMA-3.2-3B-Instruct and LLaMA-3.1-8B-Instruct are denoted as L-3 and L-8, respectively; Qwen3-0.6B, Qwen3-4B, and Qwen3-8B are denoted as Q-0.6, Q-4, and Q-8. The base denotes the accuracy of a single $M_{\text{base}}$ on each dataset.
  • Figure 2: Base model entropy limits MAS effectiveness. The left two subfigures show results for LLaMA; the right two for Qwen. (a) Relationship between feature values and SHAP values for most important entropy features on $\mathcal{G}_{\text{base-H}}$, sorted by $\bar{I}_j$ and annotated with $\rho_j$. (b) MAS performance across deciles of $M_{\text{base}}$ entropy: $M_{\text{base}}$ entropy is partitioned into ten equal-sized bins, and average MAS accuracy (aggregated over datasets and model sizes) is computed per bin. Additionally, the average $M_{\text{base}}$ entropy and accuracy across all datasets are overlaid as markers.
  • Figure 3: MAS mainly fails on inter-agent misalignment. The left two subfigures show results for LLaMA; the right two for Qwen. (a) Same as Figure \ref{['fig:base-model-analysis']}(a), but for entropy features in $\mathcal{G}_{\text{MAS}}$. (b) Impact of these features on sample predicted correctness: for each sample in the LightGBM and XGBoost test sets, we plot feature values against the average predicted probability of correctness from both models.
  • Figure 4: Uncertainty in MAS exerts distinct effects depending on task difficulty and the coordination architecture. (a, c) Feature-SHAP relationships for top entropy features in $\mathcal{G}_{\text{MAS}}$, grouped by dataset (a) and architecture (c). (b, d) Corresponding box plots across all models, annotated with average MAS correctness per dataset (b) or per architecture (d).
  • Figure 5: More rounds do not necessarily improve MAS performance. (a) Accuracy and token consumption for different MAS architectures with $R=2$ and $R=5$ on two benchmarks. (b) Evolution of three key entropy metrics across rounds. (c) The impact of two prominent entropy features, notable for their high importance ($\bar{I}$) and strong correlation ($|\rho|$) with sample correctness.
  • ...and 16 more figures