Table of Contents
Fetching ...

Silo-Bench: A Scalable Environment for Evaluating Distributed Coordination in Multi-Agent LLM Systems

Yuzhe Zhang, Feiran Liu, Yi Shan, Xinyi Huang, Xin Yang, Yueqi Zhu, Xuxin Cheng, Cao Liu, Ke Zeng, Terry Jingchen Zhang, Wenyuan Jiang

TL;DR

Silo-Bench, a role-agnostic benchmark of 30 algorithmic tasks across three communication complexity levels, is introduced, demonstrating that naively scaling agent count cannot circumvent context limitations, and providing a foundation for tracking progress toward genuinely collaborative multi-agent systems.

Abstract

Large language models are increasingly deployed in multi-agent systems to overcome context limitations by distributing information across agents. Yet whether agents can reliably compute with distributed information -- rather than merely exchange it -- remains an open question. We introduce Silo-Bench, a role-agnostic benchmark of 30 algorithmic tasks across three communication complexity levels, evaluating 54 configurations over 1,620 experiments. Our experiments expose a fundamental Communication-Reasoning Gap: agents spontaneously form task-appropriate coordination topologies and exchange information actively, yet systematically fail to synthesize distributed state into correct answers. The failure is localized to the reasoning-integration stage -- agents often acquire sufficient information but cannot integrate it. This coordination overhead compounds with scale, eventually eliminating parallelization gains entirely. These findings demonstrate that naively scaling agent count cannot circumvent context limitations, and Silo-Bench provides a foundation for tracking progress toward genuinely collaborative multi-agent systems.

Silo-Bench: A Scalable Environment for Evaluating Distributed Coordination in Multi-Agent LLM Systems

TL;DR

Silo-Bench, a role-agnostic benchmark of 30 algorithmic tasks across three communication complexity levels, is introduced, demonstrating that naively scaling agent count cannot circumvent context limitations, and providing a foundation for tracking progress toward genuinely collaborative multi-agent systems.

Abstract

Large language models are increasingly deployed in multi-agent systems to overcome context limitations by distributing information across agents. Yet whether agents can reliably compute with distributed information -- rather than merely exchange it -- remains an open question. We introduce Silo-Bench, a role-agnostic benchmark of 30 algorithmic tasks across three communication complexity levels, evaluating 54 configurations over 1,620 experiments. Our experiments expose a fundamental Communication-Reasoning Gap: agents spontaneously form task-appropriate coordination topologies and exchange information actively, yet systematically fail to synthesize distributed state into correct answers. The failure is localized to the reasoning-integration stage -- agents often acquire sufficient information but cannot integrate it. This coordination overhead compounds with scale, eventually eliminating parallelization gains entirely. These findings demonstrate that naively scaling agent count cannot circumvent context limitations, and Silo-Bench provides a foundation for tracking progress toward genuinely collaborative multi-agent systems.
Paper Structure (62 sections, 5 equations, 7 figures, 18 tables)

This paper contains 62 sections, 5 equations, 7 figures, 18 tables.

Figures (7)

  • Figure 1: Pipeline of Silo-Bench. Global information is partitioned across $N$ agents, each holding only local data. Agents must communicate through the provided protocol to reconstruct global truth. Success requires effective collaboration strategies. This is an example of the III-21 Distributed Sort (Appendix \ref{['app:tasks']}.)
  • Figure 2: Three complexity levels in Silo-Bench characterized by their communication patterns. Level I (Aggregation): A central agent collects data from all peers via a star topology. Level II (Mesh Network): Agents exchange information with immediate neighbors through pairwise communication. Level III (Global Shuffle): All agents must communicate with every other agent, requiring full mesh connectivity.
  • Figure 3: The three communication protocols employed in Silo-Bench.
  • Figure 4: Scaling behavior across agent counts. (a) Success rates decline for all models as team size increases, with sharp drops beyond $N=20$. (b) Token consumption scales roughly linearly with agent count. (c) Communication density decreases at scale, suggesting coordination sparsification.
  • Figure 5: Success rate by difficulty level.
  • ...and 2 more figures