MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

Lin Xu; Zhiyuan Hu; Daquan Zhou; Hongyu Ren; Zhen Dong; Kurt Keutzer; See Kiong Ng; Jiashi Feng

MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

Lin Xu, Zhiyuan Hu, Daquan Zhou, Hongyu Ren, Zhen Dong, Kurt Keutzer, See Kiong Ng, Jiashi Feng

TL;DR

This paper introduces MAgIC, a competition-based benchmark to evaluate LLM-powered multi-agent systems across cognition, adaptability, rationality, and collaboration using seven metrics over five scenarios (two social-deduction games and three game-theory tasks). It pairs a fixed defender with multiple challengers to generate a win-rate based ranking and augments LLMs with Probabilistic Graphical Models (PGMs) to form PGM-Aware Agents, achieving on average a 37% performance uplift. Across seven evaluated LLMs, GPT-4-turbo leads the leaderboard, underscoring persistent gaps between top and lower-capability models. The work also demonstrates that PGM integration yields consistent improvements and offers a framework for future expansion of multi-agent social-cognition benchmarks.

Abstract

Large Language Models (LLMs) have significantly advanced natural language processing, demonstrating exceptional reasoning, tool usage, and memory capabilities. As their applications expand into multi-agent environments, there arises a need for a comprehensive evaluation framework that captures LLMs' reasoning, planning, collaboration, and other social abilities. This work introduces a novel competition-based benchmark framework specifically designed to assess LLMs within multi-agent settings, providing quantitative metrics to evaluate their judgment, reasoning, deception, self-awareness, cooperation, coordination, and rationality. We utilize two social deduction games alongside three game-theory scenarios to create diverse environments. Our frame is fortified with the probabilistic graphic modeling (PGM) method, enhancing the LLMs' capabilities in navigating complex social and cognitive dimensions. We evaluate seven LLMs, quantitatively highlighting a significant capability gap of over threefold between the strongest, GPT o1, and the weakest, Llama-2-70B. It also confirms that our PGM enhancement boosts the abilities of all selected models by an average of 37%. Our data and code can be found here https://github.com/cathyxl/MAgIC.

MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

TL;DR

Abstract

Paper Structure (29 sections, 13 equations, 10 figures, 6 tables)

This paper contains 29 sections, 13 equations, 10 figures, 6 tables.

Introduction
Related Work
Benchmark
Scenarios
Competition Settings
Evaluation Metrics
PGM-Aware Agent
PGM Structure
LLM Decision with PGM
Experiments
LLM Leaderboard
PGM Enhancement Performance
Analysis
Discussion: Generalization of Benchmark
Conclusion
...and 14 more sections

Figures (10)

Figure 1: The radar chart depicts LLMs' performance on 7 metrics, with "-T" for "-turbo" and "+P" for "+PGM". The bar chart displays the polygons' areas, and the red line indicates average game-winning rates. Larger areas correlate with higher winning rates, validating the effectiveness of the proposed metrics for assessing LLMs' capabilities. For more information, refer to Sec. \ref{['sec:experiments']}.
Figure 2: Overview of evaluation setting, scenarios, and proposed metrics.
Figure 3: A Decision process of the PGM-aware agent. This example involves an undercover game where the PGM-Aware agent B believes that agent C is the undercover. Consequently, B decides to respond with "It is deep," which better describes the features of the word "cup" rather than the undercover word "mug".
Figure 4: The comparison between PGM-aware and vanilla agents involves seven metrics. Most PGM-aware agents significantly outperform the vanilla ones in 3-4 out of the 7 abilities, with p-values lower than 0.05(t-test).
Figure 5: A case study on Chameleon, Llama-2-70B, GPT-4, and their PGM-enhanced versions. The numerical probabilities are calculated by extracting judgments in the text-based PGM and normalized into scale of 0 to 1.
...and 5 more figures

MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

TL;DR

Abstract

MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

Authors

TL;DR

Abstract

Table of Contents

Figures (10)