Table of Contents
Fetching ...

AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs

Florian Grötschla, Luis Müller, Jan Tönshoff, Mikhail Galkin, Bryan Perozzi

TL;DR

AgentsNet introduces a scalable, graph-topology grounded benchmark for multi-agent LLM systems built on five foundational distributed problems: Coloring, VertexCover, Maximal Matching, Leader Election, and Consensus. By mapping these tasks to decentralized agent coordination and implementing a robust message-passing protocol inspired by the LOCAL model, the paper evaluates a broad suite of models (including frontier LLMs) across 4/8/16- and up to 100-agent graphs drawn from SmallWorld, ScaleFree, and Delaunay topologies. Key findings show that while some frontier models perform well on small networks, performance degrades with network size, underscoring scalability and coordination challenges in decentralized reasoning. The work provides open-source code and datasets, demonstrates meaningful model differentiation beyond simple benchmarks, and advocates for future protocol innovations and heterogeneous-agent setups to提升 scalability and robustness in distributed multi-agent systems.

Abstract

Large-language models (LLMs) have demonstrated powerful problem-solving capabilities, in particular when organized in multi-agent systems. However, the advent of such systems also raises several questions on the ability of a complex network of agents to effectively self-organize and collaborate. While measuring performance on standard reasoning benchmarks indicates how well multi-agent systems can solve reasoning tasks, it is unclear whether these systems are able to leverage their topology effectively. Here, we propose AgentsNet, a new benchmark for multi-agent reasoning. By drawing inspiration from classical problems in distributed systems and graph theory, AgentsNet measures the ability of multi-agent systems to collaboratively form strategies for problem-solving, self-organization, and effective communication given a network topology. We evaluate a variety of baseline methods on AgentsNet including homogeneous networks of agents which first have to agree on basic protocols for organization and communication. We find that some frontier LLMs are already demonstrating strong performance for small networks but begin to fall off once the size of the network scales. While existing multi-agent benchmarks cover at most 2-5 agents, AgentsNet is practically unlimited in size and can scale with new generations of LLMs. As such, we also probe frontier models in a setup with up to 100 agents.

AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs

TL;DR

AgentsNet introduces a scalable, graph-topology grounded benchmark for multi-agent LLM systems built on five foundational distributed problems: Coloring, VertexCover, Maximal Matching, Leader Election, and Consensus. By mapping these tasks to decentralized agent coordination and implementing a robust message-passing protocol inspired by the LOCAL model, the paper evaluates a broad suite of models (including frontier LLMs) across 4/8/16- and up to 100-agent graphs drawn from SmallWorld, ScaleFree, and Delaunay topologies. Key findings show that while some frontier models perform well on small networks, performance degrades with network size, underscoring scalability and coordination challenges in decentralized reasoning. The work provides open-source code and datasets, demonstrates meaningful model differentiation beyond simple benchmarks, and advocates for future protocol innovations and heterogeneous-agent setups to提升 scalability and robustness in distributed multi-agent systems.

Abstract

Large-language models (LLMs) have demonstrated powerful problem-solving capabilities, in particular when organized in multi-agent systems. However, the advent of such systems also raises several questions on the ability of a complex network of agents to effectively self-organize and collaborate. While measuring performance on standard reasoning benchmarks indicates how well multi-agent systems can solve reasoning tasks, it is unclear whether these systems are able to leverage their topology effectively. Here, we propose AgentsNet, a new benchmark for multi-agent reasoning. By drawing inspiration from classical problems in distributed systems and graph theory, AgentsNet measures the ability of multi-agent systems to collaboratively form strategies for problem-solving, self-organization, and effective communication given a network topology. We evaluate a variety of baseline methods on AgentsNet including homogeneous networks of agents which first have to agree on basic protocols for organization and communication. We find that some frontier LLMs are already demonstrating strong performance for small networks but begin to fall off once the size of the network scales. While existing multi-agent benchmarks cover at most 2-5 agents, AgentsNet is practically unlimited in size and can scale with new generations of LLMs. As such, we also probe frontier models in a setup with up to 100 agents.

Paper Structure

This paper contains 50 sections, 11 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Mean AgentsNet score of models versus API costs per repeat (May 15, 2025). Error bars indicate standard error of the mean. Gold stars denote Pareto-optimal models.
  • Figure 2: Example communication between three agents on a simplified topology. Agents Emily, Zach, and Tom each receive and send messages to their neighbors in multiple rounds of message-passing; see \ref{['app:qualitative']} for an in-depth qualitative analysis of transcripts.
  • Figure 3: Overview of the tasks in AgentsNet: In LeaderElection, the task is to select a single agent as the leader of the network. In Consensus, the task is for all agents to agree on a specific value, for example $0$ or $1$. In Matching, the task is for pairs of agents to team up without conflicts. In Coloring, the task is for agents to select a group (indicated by a color), such that none of their neighbors are in the same group as them. In VertexCover, the task is to find a minimal group of coordinator agents such that each agent is a neighbor to at least one coordinator.
  • Figure 4: Fraction of solved instances per task and model, grouped by graph size (4, 8, and 16 nodes). Each task contributes up to 20% to the total, as tasks are equally distributed across the five benchmark tasks. Reasoning and non-reasoning models are visually separated. This breakdown complements \ref{['fig:performance_by_cost']} by providing a more granular view of task-level performance.
  • Figure 5: Scalability of Gemini 2.0 Flash on AgentsNet: Average fraction of successfully solved instances per task as the graph size increases from 20 to 100 agents.
  • ...and 3 more figures