Table of Contents
Fetching ...

Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia

Chandler Smith, Marwa Abdulhai, Manfred Diaz, Marko Tesic, Rakshit S. Trivedi, Alexander Sasha Vezhnevets, Lewis Hammond, Jesse Clifton, Minsuk Chang, Edgar A. Duéñez-Guzmán, John P. Agapiou, Jayd Matyas, Danny Karmon, Akash Kundu, Aliaksei Korshuk, Ananya Ananya, Arrasy Rahman, Avinaash Anand Kulandaivel, Bain McHale, Beining Zhang, Buyantuev Alexander, Carlos Saith Rodriguez Rojas, Caroline Wang, Chetan Talele, Chenao Liu, Chichen Lin, Diana Riazi, Di Yang Shi, Emanuel Tewolde, Elizaveta Tennant, Fangwei Zhong, Fuyang Cui, Gang Zhao, Gema Parreño Piqueras, Hyeonggeun Yun, Ilya Makarov, Jiaxun Cui, Jebish Purbey, Jim Dilkes, Jord Nguyen, Lingyun Xiao, Luis Felipe Giraldo, Manuela Chacon-Chamorro, Manuel Sebastian Rios Beltran, Marta Emili García Segura, Mengmeng Wang, Mogtaba Alim, Nicanor Quijano, Nico Schiavone, Olivia Macmillan-Scott, Oswaldo Peña, Peter Stone, Ram Mohan Rao Kadiyala, Rolando Fernandez, Ruben Manrique, Sunjia Lu, Sheila A. McIlraith, Shamika Dhuri, Shuqing Shi, Siddhant Gupta, Sneheel Sarangi, Sriram Ganapathi Subramanian, Taehun Cha, Toryn Q. Klassen, Wenming Tu, Weijian Fan, Wu Ruiyang, Xue Feng, Yali Du, Yang Liu, Yiding Wang, Yipeng Kang, Yoonchang Sung, Yuxuan Chen, Zhaowei Zhang, Zhihan Wang, Zhiqiang Wu, Ziang Chen, Zilong Zheng, Zixia Jia, Ziyan Wang, Dylan Hadfield-Menell, Natasha Jaques, Tim Baarslag, Jose Hernandez-Orallo, Joel Z. Leibo

TL;DR

This work defines Concordia, a natural-language multi-agent platform, to rigorously evaluate cooperative generalization of LLM-based agents in mixed-motive scenarios. It introduces five cooperation-eliciting substrates and a veil-of-ignorance evaluation protocol to test zero-shot generalization across unfamiliar partners. Empirical results from the NeurIPS 2024 Concordia Contest reveal meaningful gaps in current agent capabilities, especially in persuasion and norm enforcement, and demonstrate the utility of multiple ranking approaches (Elo, Iterative Maximal Lotteries, Copeland, Ranked Pairs) for diagnosing robustness. The study highlights the need for stronger zero-shot coordination and multi-modal, multi-agent evaluation in future work, with implications for deploying cooperative AI in real-world social contexts.

Abstract

Large Language Model (LLM) agents have demonstrated impressive capabilities for social interaction and are increasingly being deployed in situations where they might engage with both human and artificial agents. These interactions represent a critical frontier for LLM-based agents, yet existing evaluation methods fail to measure how well these capabilities generalize to novel social situations. In this paper, we introduce a method for evaluating the ability of LLM-based agents to cooperate in zero-shot, mixed-motive environments using Concordia, a natural language multi-agent simulation environment. Our method measures general cooperative intelligence by testing an agent's ability to identify and exploit opportunities for mutual gain across diverse partners and contexts. We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains across a suite of diverse scenarios ranging from negotiation to collective action problems. Our findings reveal significant gaps between current agent capabilities and the robust generalization required for reliable cooperation, particularly in scenarios demanding persuasion and norm enforcement.

Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia

TL;DR

This work defines Concordia, a natural-language multi-agent platform, to rigorously evaluate cooperative generalization of LLM-based agents in mixed-motive scenarios. It introduces five cooperation-eliciting substrates and a veil-of-ignorance evaluation protocol to test zero-shot generalization across unfamiliar partners. Empirical results from the NeurIPS 2024 Concordia Contest reveal meaningful gaps in current agent capabilities, especially in persuasion and norm enforcement, and demonstrate the utility of multiple ranking approaches (Elo, Iterative Maximal Lotteries, Copeland, Ranked Pairs) for diagnosing robustness. The study highlights the need for stronger zero-shot coordination and multi-modal, multi-agent evaluation in future work, with implications for deploying cooperative AI in real-world social contexts.

Abstract

Large Language Model (LLM) agents have demonstrated impressive capabilities for social interaction and are increasingly being deployed in situations where they might engage with both human and artificial agents. These interactions represent a critical frontier for LLM-based agents, yet existing evaluation methods fail to measure how well these capabilities generalize to novel social situations. In this paper, we introduce a method for evaluating the ability of LLM-based agents to cooperate in zero-shot, mixed-motive environments using Concordia, a natural language multi-agent simulation environment. Our method measures general cooperative intelligence by testing an agent's ability to identify and exploit opportunities for mutual gain across diverse partners and contexts. We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains across a suite of diverse scenarios ranging from negotiation to collective action problems. Our findings reveal significant gaps between current agent capabilities and the robust generalization required for reliable cooperation, particularly in scenarios demanding persuasion and norm enforcement.

Paper Structure

This paper contains 40 sections, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Overview of the 2024 NeurIPS Concordia Contest Framework. Contestants (top left) design and submit agents that, during the Development Phase, interact with background co‑players across five cooperation‑eliciting substrates (Pub Coordination, Haggling, State Formation, Labor Collective Action, Reality Show). Each scenario is run in either Resident and Visitor modes (see Section \ref{['sec:scenario-design']}) under the orchestration of a Game Master, which mediates action attempts, determines resulting events, issues event statements and observations, and computes the agent's and co‑player's scores. Marked by the blue star, contestants submit their final agents to the Evaluation Phase, which proceeds under a "veil of ignorance". Agents are first Elo‑ranked in novel scenarios, then the top five performing agents engage in a cross‑play round, which is used to determine the overall winner.
  • Figure 2: Posterior distributions (mean and 95% HDI) of agent performance (log-odds difference) relative to the rational agent baseline. Vertical dashed line indicates no difference.
  • Figure 3: Left: Mean score for each scenario (error bars are ±1 SE). Agents performed relatively poorly in most scenarios, however in several scenarios their average score exceeded 50% of the theoretical maximum, and in one case approached 90%. Right: Posterior means and 94% highest-density intervals of the tag coefficients from the hierarchical beta-regression with LKJ-correlated priors. Nearly all cooperative tags have negative coefficients, meaning that the presence of these tags lowers agents’ scores. This pattern indicates that agents struggled when scenarios required cooperation, as expected for a cooperation-eliciting benchmark.
  • Figure 4: Mean scores for each focal agent, with error bars indicating ±1 standard error of the mean. Agents are ordered by decreasing mean score.
  • Figure 5: Heatmap of the Pearson correlation coefficients between the score and each tag as well as substrates and resident status.
  • ...and 3 more figures