On the Importance of Task Complexity in Evaluating LLM-Based Multi-Agent Systems
Bohan Tang, Huidong Liang, Keyue Jiang, Xiaowen Dong
TL;DR
This paper tackles when LLM-based multi-agent systems (LLM-MAS) outperform single-agent baselines, arguing that task complexity—captured as depth (sequential reasoning length) and width (breadth of required capabilities)—is fundamental. It introduces a theoretical framework with per-step success $s(w)=q^w$, leading to $S_{ ext{single}}(d,w)=[s(w)]^{d}$ for a single agent and $S_{ ext{multi}}(d,w,N,r)= r[1-(1-s(w))^{N}]^{d}$ for a multi-agent setup with $N$ agents and aggregator reliability $r$, and defines the gain $\ \Delta(d,w,N,r)=\frac{S_{ ext{multi}}-S_{ ext{single}}}{S_{ ext{single}}}$, illustrating that both depth and width boost LLM-MAS gains while depth exhibits unbounded growth and width saturates. The authors prove $\partial \Delta/\partial d>0$ and $\partial \Delta/\partial w>0$, and show $\lim_{w\to\infty}\Delta=(rN)^d-1$, $\lim_{d\to\infty}\Delta=+\infty$, and validate these predictions empirically on math reasoning (DyVal) and creative writing (DW$^2$). They also introduce a DW$^2$ benchmark to quantify depth and width in generative tasks and perform a Shapley-$R^2$ (S-Score) analysis to quantify dimension influence. The work provides principled guidance for designing LLM-MAS benchmarks and adaptive, task-aware multi-agent systems, suggesting that deep reasoning demands are particularly well suited to collaborative approaches.
Abstract
Large language model multi-agent systems (LLM-MAS) offer a promising paradigm for harnessing collective intelligence to achieve more advanced forms of AI behaviour. While recent studies suggest that LLM-MAS can outperform LLM single-agent systems (LLM-SAS) on certain tasks, the lack of systematic experimental designs limits the strength and generality of these conclusions. We argue that a principled understanding of task complexity, such as the degree of sequential reasoning required and the breadth of capabilities involved, is essential for assessing the effectiveness of LLM-MAS in task solving. To this end, we propose a theoretical framework characterising tasks along two dimensions: depth, representing reasoning length, and width, representing capability diversity. We theoretically examine a representative class of LLM-MAS, namely the multi-agent debate system, and empirically evaluate its performance in both discriminative and generative tasks with varying depth and width. Theoretical and empirical results show that the benefit of LLM-MAS over LLM-SAS increases with both task depth and width, and the effect is more pronounced with respect to depth. This clarifies when LLM-MAS are beneficial and provides a principled foundation for designing future LLM-MAS methods and benchmarks.
