Table of Contents
Fetching ...

On the Reliability Limits of LLM-Based Multi-Agent Planning

Ruicheng Ao, Siyang Gao, David Simchi-Levi

Abstract

This technical note studies the reliability limits of LLM-based multi-agent planning as a delegated decision problem. We model the LLM-based multi-agent architecture as a finite acyclic decision network in which multiple stages process shared model-context information, communicate through language interfaces with limited capacity, and may invoke human review. We show that, without new exogenous signals, any delegated network is decision-theoretically dominated by a centralized Bayes decision maker with access to the same information. In the common-evidence regime, this implies that optimizing over multi-agent directed acyclic graphs under a finite communication budget can be recast as choosing a budget-constrained stochastic experiment on the shared signal. We also characterize the loss induced by communication and information compression. Under proper scoring rules, the gap between the centralized Bayes value and the value after communication admits an expected posterior divergence representation, which reduces to conditional mutual information under logarithmic loss and to expected squared posterior error under the Brier score. These results characterize the fundamental reliability limits of delegated LLM planning. Experiments with LLMs on a controlled problem set further demonstrate these characterizations.

On the Reliability Limits of LLM-Based Multi-Agent Planning

Abstract

This technical note studies the reliability limits of LLM-based multi-agent planning as a delegated decision problem. We model the LLM-based multi-agent architecture as a finite acyclic decision network in which multiple stages process shared model-context information, communicate through language interfaces with limited capacity, and may invoke human review. We show that, without new exogenous signals, any delegated network is decision-theoretically dominated by a centralized Bayes decision maker with access to the same information. In the common-evidence regime, this implies that optimizing over multi-agent directed acyclic graphs under a finite communication budget can be recast as choosing a budget-constrained stochastic experiment on the shared signal. We also characterize the loss induced by communication and information compression. Under proper scoring rules, the gap between the centralized Bayes value and the value after communication admits an expected posterior divergence representation, which reduces to conditional mutual information under logarithmic loss and to expected squared posterior error under the Brier score. These results characterize the fundamental reliability limits of delegated LLM planning. Experiments with LLMs on a controlled problem set further demonstrate these characterizations.

Paper Structure

This paper contains 16 sections, 10 theorems, 26 equations, 5 figures, 3 tables.

Key Result

Lemma 1

Let $H$ be any information state and let $\ell:\mathcal{A}\times\mathcal{Y}\to\mathbb R_+$ be bounded. Define $\Lambda(h,a) :=\mathbb{E}[\ell(a,Y)\mid H=h]$. Then the Bayes risk $V(H;\ell)$ satisfies Moreover, for every $\varepsilon>0$, there exists an $H$-measurable selector $\delta_\varepsilon$ such that $\Lambda(H,\delta_\varepsilon(H)) \le \inf_{a\in\mathcal{A}}\Lambda(H,a)+\varepsilon$ almos

Figures (5)

  • Figure 1: Accuracy versus relay length on MMLU (200 questions). A single agent achieves ${\sim}90\%$ on both models. Each additional communication stage degrades accuracy. At five agents, performance falls below the $25\%$ chance baseline.
  • Figure 2: Per-question posterior distortion versus accuracy drop relative to the single-agent baseline. Each point is one MMLU question (averaged across runs). Questions with larger KL divergence suffer larger accuracy losses.
  • Figure 3: Accuracy versus communication stage for posterior vector relay and natural-language prose relay (50 MMLU questions, $n{=}750$ per condition).The posterior interface degrades much more slowly.
  • Figure 4: Redundant versus non-redundant tool access (Corollary \ref{['cor:verification']}). Both groups use the same single-agent architecture with optional tool lookup. Left (MMLU): the model's parametric knowledge already covers the questions, so the tool adds little ($+3.8$ pts). Right (Synthetic KB): 200 fictional entities unknown to the model. Without the tool the model is near chance ($24.3\%$), with the tool accuracy reaches $82.7\%$ ($+58.4$ pts).
  • Figure 5: Information flow for three representative conditions. (A) centralized, where a single agent observes $B$ directly. (B) serial relay, where each stage sees only the previous message. (C) tool-augmented, where the agent acquires an external signal $Z$. Prompt templates are shown in Boxes \ref{['box:condA']}-\ref{['box:condS']}.

Theorems & Definitions (16)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Proposition 6
  • Theorem 7
  • Remark 1
  • Remark 2
  • Remark 3
  • ...and 6 more