Table of Contents
Fetching ...

On the Importance of Task Complexity in Evaluating LLM-Based Multi-Agent Systems

Bohan Tang, Huidong Liang, Keyue Jiang, Xiaowen Dong

TL;DR

This paper tackles when LLM-based multi-agent systems (LLM-MAS) outperform single-agent baselines, arguing that task complexity—captured as depth (sequential reasoning length) and width (breadth of required capabilities)—is fundamental. It introduces a theoretical framework with per-step success $s(w)=q^w$, leading to $S_{ ext{single}}(d,w)=[s(w)]^{d}$ for a single agent and $S_{ ext{multi}}(d,w,N,r)= r[1-(1-s(w))^{N}]^{d}$ for a multi-agent setup with $N$ agents and aggregator reliability $r$, and defines the gain $\ \Delta(d,w,N,r)=\frac{S_{ ext{multi}}-S_{ ext{single}}}{S_{ ext{single}}}$, illustrating that both depth and width boost LLM-MAS gains while depth exhibits unbounded growth and width saturates. The authors prove $\partial \Delta/\partial d>0$ and $\partial \Delta/\partial w>0$, and show $\lim_{w\to\infty}\Delta=(rN)^d-1$, $\lim_{d\to\infty}\Delta=+\infty$, and validate these predictions empirically on math reasoning (DyVal) and creative writing (DW$^2$). They also introduce a DW$^2$ benchmark to quantify depth and width in generative tasks and perform a Shapley-$R^2$ (S-Score) analysis to quantify dimension influence. The work provides principled guidance for designing LLM-MAS benchmarks and adaptive, task-aware multi-agent systems, suggesting that deep reasoning demands are particularly well suited to collaborative approaches.

Abstract

Large language model multi-agent systems (LLM-MAS) offer a promising paradigm for harnessing collective intelligence to achieve more advanced forms of AI behaviour. While recent studies suggest that LLM-MAS can outperform LLM single-agent systems (LLM-SAS) on certain tasks, the lack of systematic experimental designs limits the strength and generality of these conclusions. We argue that a principled understanding of task complexity, such as the degree of sequential reasoning required and the breadth of capabilities involved, is essential for assessing the effectiveness of LLM-MAS in task solving. To this end, we propose a theoretical framework characterising tasks along two dimensions: depth, representing reasoning length, and width, representing capability diversity. We theoretically examine a representative class of LLM-MAS, namely the multi-agent debate system, and empirically evaluate its performance in both discriminative and generative tasks with varying depth and width. Theoretical and empirical results show that the benefit of LLM-MAS over LLM-SAS increases with both task depth and width, and the effect is more pronounced with respect to depth. This clarifies when LLM-MAS are beneficial and provides a principled foundation for designing future LLM-MAS methods and benchmarks.

On the Importance of Task Complexity in Evaluating LLM-Based Multi-Agent Systems

TL;DR

This paper tackles when LLM-based multi-agent systems (LLM-MAS) outperform single-agent baselines, arguing that task complexity—captured as depth (sequential reasoning length) and width (breadth of required capabilities)—is fundamental. It introduces a theoretical framework with per-step success , leading to for a single agent and for a multi-agent setup with agents and aggregator reliability , and defines the gain , illustrating that both depth and width boost LLM-MAS gains while depth exhibits unbounded growth and width saturates. The authors prove and , and show , , and validate these predictions empirically on math reasoning (DyVal) and creative writing (DW). They also introduce a DW benchmark to quantify depth and width in generative tasks and perform a Shapley- (S-Score) analysis to quantify dimension influence. The work provides principled guidance for designing LLM-MAS benchmarks and adaptive, task-aware multi-agent systems, suggesting that deep reasoning demands are particularly well suited to collaborative approaches.

Abstract

Large language model multi-agent systems (LLM-MAS) offer a promising paradigm for harnessing collective intelligence to achieve more advanced forms of AI behaviour. While recent studies suggest that LLM-MAS can outperform LLM single-agent systems (LLM-SAS) on certain tasks, the lack of systematic experimental designs limits the strength and generality of these conclusions. We argue that a principled understanding of task complexity, such as the degree of sequential reasoning required and the breadth of capabilities involved, is essential for assessing the effectiveness of LLM-MAS in task solving. To this end, we propose a theoretical framework characterising tasks along two dimensions: depth, representing reasoning length, and width, representing capability diversity. We theoretically examine a representative class of LLM-MAS, namely the multi-agent debate system, and empirically evaluate its performance in both discriminative and generative tasks with varying depth and width. Theoretical and empirical results show that the benefit of LLM-MAS over LLM-SAS increases with both task depth and width, and the effect is more pronounced with respect to depth. This clarifies when LLM-MAS are beneficial and provides a principled foundation for designing future LLM-MAS methods and benchmarks.

Paper Structure

This paper contains 13 sections, 2 theorems, 13 equations, 6 figures.

Key Result

Proposition 2.1

Let $\mathcal{T} = \{s(w)\}_{t=1}^d$ denote a given task. According to Definition def:pg, we have $\Delta(d,w,N,r) \;\triangleq\; \frac{S_{\mathrm{multi}}-S_{\mathrm{single}}}{S_{\mathrm{single}}}.$ Then, we have $\frac{\partial \Delta}{\partial d}\;>\;0$ and $\frac{\partial \Delta}{\partial w}\;>\;

Figures (6)

  • Figure 1: Left: math reasoning and creative writing tasks with controllable complexity in terms of width and depth. Right: Exemplar LLM-SAS and LLM-MAS framework. For simplicity, the input question to each agent is omitted from the presentation starting with the second debate turn.
  • Figure 2: Visualization of task complexity defined by depth and width. The pipeline represents one round of multi-agent debate. "Agg" stands for aggregator.
  • Figure 3: Results on the math reasoning benchmark.
  • Figure 4: Results on the creative writing benchmark.
  • Figure 5: The accuracy of LLM-SAS and LLM-MAS on math reasoning.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Definition 2.1: The task defined by depth and width
  • Definition 2.2: The success rate of LLM-SAS
  • Definition 2.3: The success rate of LLM-MAS
  • Definition 2.4: LLM-MAS Performance Gain over LLM-SAS
  • Proposition 2.1: Increase of LLM-MAS Performance Gain with Depth $d$ and Width $w$
  • Proposition 2.2: Unbounded Growth in Depth vs. Finite Saturation in Width
  • proof
  • proof