How Different AI Chatbots Behave? Benchmarking Large Language Models in Behavioral Economics Games
Yutong Xie, Yiyao Liu, Zhuang Ma, Lin Shi, Xiyuan Wang, Walter Yuan, Matthew O. Jackson, Qiaozhu Mei
TL;DR
This study benchmarks five prominent LLM-based chatbot families across six behavioral economics games to characterize their decision-making and behavioral patterns, extending prior work on the behavioral Turing test. By collecting 50 independent responses per model per game and comparing to human distributions, the authors quantify distributional similarity, payoff preferences, and consistency using techniques such as Wasserstein distance and logistic multinomial modeling. Key findings show that while AI chatbots exhibit convergence to specific human-like behavior modes and can pass Turing tests with notable success, they generally produce more concentrated distributions, emphasize fairness, and display inconsistencies across games and models. The work highlights implications for deploying LLMs in high-stakes decision contexts and emphasizes the need for alignment objectives that account for human behavioral diversity and cross-model differences.
Abstract
The deployment of large language models (LLMs) in diverse applications requires a thorough understanding of their decision-making strategies and behavioral patterns. As a supplement to a recent study on the behavioral Turing test, this paper presents a comprehensive analysis of five leading LLM-based chatbot families as they navigate a series of behavioral economics games. By benchmarking these AI chatbots, we aim to uncover and document both common and distinct behavioral patterns across a range of scenarios. The findings provide valuable insights into the strategic preferences of each LLM, highlighting potential implications for their deployment in critical decision-making roles.
