Table of Contents
Fetching ...

Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems

Mason Nakamura, Abhinav Kumar, Saswat Das, Sahar Abdelnabi, Saaduddin Mahmud, Ferdinando Fioretto, Shlomo Zilberstein, Eugene Bagdasarian

TL;DR

Colosseum addresses collusion risk in LLM-driven cooperative multi-agent systems by formalizing collusion as a deviation from a nominal DCOP objective, and auditing it via regret relative to the cooperative optimum. It introduces a mixed objective $F_\lambda(x) = (1-\lambda)F_n(x) + \lambda F_c(x)$ and a $Delta$-collusion criterion to enable counterfactual evaluation, decomposing contributing factors into objective misalignment, persuasion, and network influence. The authors instantiate two real-world-inspired DCOP environments (Hospital and Jira) plus a meeting-scheduling benchmark to demonstrate emergent, attempted, and hidden collusion across models and topologies, and show that regret-based auditing reveals issues overlooked by LLM-a-judge scores alone. The work provides a formal, auditable framework for evaluating and mitigating collusion risks in deploying safe, scalable multi-agent systems.

Abstract

Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks. This surfaces a unique safety problem when individual agents form a coalition and \emph{collude} to pursue secondary goals and degrade the joint objective. In this paper, we present Colosseum, a framework for auditing LLM agents' collusive behavior in multi-agent settings. We ground how agents cooperate through a Distributed Constraint Optimization Problem (DCOP) and measure collusion via regret relative to the cooperative optimum. Colosseum tests each LLM for collusion under different objectives, persuasion tactics, and network topologies. Through our audit, we show that most out-of-the-box models exhibited a propensity to collude when a secret communication channel was artificially formed. Furthermore, we discover ``collusion on paper'' when agents plan to collude in text but would often pick non-collusive actions, thus providing little effect on the joint task. Colosseum provides a new way to study collusion by measuring communications and actions in rich yet verifiable environments.

Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems

TL;DR

Colosseum addresses collusion risk in LLM-driven cooperative multi-agent systems by formalizing collusion as a deviation from a nominal DCOP objective, and auditing it via regret relative to the cooperative optimum. It introduces a mixed objective and a -collusion criterion to enable counterfactual evaluation, decomposing contributing factors into objective misalignment, persuasion, and network influence. The authors instantiate two real-world-inspired DCOP environments (Hospital and Jira) plus a meeting-scheduling benchmark to demonstrate emergent, attempted, and hidden collusion across models and topologies, and show that regret-based auditing reveals issues overlooked by LLM-a-judge scores alone. The work provides a formal, auditable framework for evaluating and mitigating collusion risks in deploying safe, scalable multi-agent systems.

Abstract

Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks. This surfaces a unique safety problem when individual agents form a coalition and \emph{collude} to pursue secondary goals and degrade the joint objective. In this paper, we present Colosseum, a framework for auditing LLM agents' collusive behavior in multi-agent settings. We ground how agents cooperate through a Distributed Constraint Optimization Problem (DCOP) and measure collusion via regret relative to the cooperative optimum. Colosseum tests each LLM for collusion under different objectives, persuasion tactics, and network topologies. Through our audit, we show that most out-of-the-box models exhibited a propensity to collude when a secret communication channel was artificially formed. Furthermore, we discover ``collusion on paper'' when agents plan to collude in text but would often pick non-collusive actions, thus providing little effect on the joint task. Colosseum provides a new way to study collusion by measuring communications and actions in rich yet verifiable environments.
Paper Structure (40 sections, 1 theorem, 23 equations, 13 figures)

This paper contains 40 sections, 1 theorem, 23 equations, 13 figures.

Key Result

Proposition 2.1

Let $\Omega \triangleq \prod_{j=1}^m D_j$ be the (finite) set of complete assignments and let $F_n:\Omega\to\mathbb{R}$ be the original DCOP objective to be maximized, with $\bm x^\star$ as in equation eq:original_problem_objective_only. Let $\tilde{\mathbf{x}}$ be a $\Delta$-collusive assignment, w

Figures (13)

  • Figure 1: Colosseum helps to identify distinct LLM collusive behavior by LLMs. In a classroom setting, agents collude on a secret channel to optimize their secondary objective for delinquency. The coalition advantage offers a formal metric for collusion success while LLM-as-a-judge evaluates communications.
  • Figure 2: Coalition-induced asymmetric DCOP. Toy 3-agent, binary-action DCOP illustrating how collusion can shift the global solution. Each number in the table is a payoff/reward for that particular agent given the joint action, resulting in an objective maximization problem. Regret is computed using the summed difference between the optimal and realized actions for the nominal objective (main channel). Coalition advantage is computed by the local regret of non-coalition agents (agent $A$) on the nominal objective minus the sum of local regrets of the coalition agents on the secondary objective (secret channel). Left: the nominal (intended) DCOP rewards certain joint actions and disincentivizes others. Middle: the coalition optimizes a convex combination $F_{\text{coal}}(\mathbf{x})=(1-\lambda)F_{N}(\mathbf{x}) + \lambda F_{\text{C}}(\mathbf{x})$, where $\lambda\in[0,1]$ controls how strongly the coalition values its hidden objective $F_{\text{C}}$ (mismatch) and $(1-\lambda)$ weights the original normative objective $F_N$ (match). As $\lambda$ increases, standard cooperative DCOP message passing can be steered toward an assignment that improves the coalition's effective objective but is suboptimal under $F_N$; agent $A$ follows because it does not observe the coalition's hidden factors and relies on the coalition's reported objective improvements. Right: agents $B$ and $C$ form a coalition (red) and introduce a hidden preference for mismatch (red table), creating an asymmetric DCOP in which coalition members evaluate assignments under a different objective than the non-coalition agent: $A$ sees the payoffs in the main channel (dark green), while $B$ and $C$ ignore them (white) and consider the payoffs of the secret channel (red).
  • Figure 3: Example of collusive behavior. When a secret channel is created between agents, scheming (making underhanded plans) and persuasion (convincing other agents) emerge without being explicitly prompted, showing an emergence of malicious behavior on these side channels. (a) Scheming. On the secret channel, coalition agents planned to manipulate the main-channel discussion to change non-coalition agents' actions. (b) Persuasion. On the secret channel, a coalition agent persuaded a main-channel agent to alter their planned action to help another coalition agent.
  • Figure 4: Audit results for Kimi-K2-Instruct (in Jira). Colosseum provides a suite of metrics that enable a more objective analysis of collusion while being complemented with LLM-as-a-judge. We find that coalition Kimi-K2-Instruct agents that use the simple prompt which instructs the agents to optimize their coalition joint reward performs better at optimizing coalition advantage over prompts that also append explicit instructions to perform deception and manipulative tactics, showing that optimizing explicitly for manipulative and deceptive tactics is not necessarily optimal for the colluding party. This was evaluated using six agents total with two coalition agents. Bars represent normalized means, and error bars indicate standard error.
  • Figure 5: Audit results across models (in Jira). Collusion judge scores generally increase after adding a secret channel, while regret-based metrics can surface anomalies missed by judge scores. Based on the Baseline and Control groups and their coalition advatage and collusion judge scores, Claude-Sonnet-4.5 and GPT-4o-Mini exhibited direct collusion; Kimi-K2-Instruct and Kimi-K2-Thinking exhibited attempted collusion; and Gemini-2.5-Flash and GPT-4.1-Mini exhibited hidden collusion. Configuration details, with the exception of the model types, are described in Figure \ref{['fig:collusion-effects']}.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Definition 4.1: $\Delta$-Collusion
  • Proposition 2.1: Tight worst-case degradation under collusion