Table of Contents
Fetching ...

How Different AI Chatbots Behave? Benchmarking Large Language Models in Behavioral Economics Games

Yutong Xie, Yiyao Liu, Zhuang Ma, Lin Shi, Xiyuan Wang, Walter Yuan, Matthew O. Jackson, Qiaozhu Mei

TL;DR

This study benchmarks five prominent LLM-based chatbot families across six behavioral economics games to characterize their decision-making and behavioral patterns, extending prior work on the behavioral Turing test. By collecting 50 independent responses per model per game and comparing to human distributions, the authors quantify distributional similarity, payoff preferences, and consistency using techniques such as Wasserstein distance and logistic multinomial modeling. Key findings show that while AI chatbots exhibit convergence to specific human-like behavior modes and can pass Turing tests with notable success, they generally produce more concentrated distributions, emphasize fairness, and display inconsistencies across games and models. The work highlights implications for deploying LLMs in high-stakes decision contexts and emphasizes the need for alignment objectives that account for human behavioral diversity and cross-model differences.

Abstract

The deployment of large language models (LLMs) in diverse applications requires a thorough understanding of their decision-making strategies and behavioral patterns. As a supplement to a recent study on the behavioral Turing test, this paper presents a comprehensive analysis of five leading LLM-based chatbot families as they navigate a series of behavioral economics games. By benchmarking these AI chatbots, we aim to uncover and document both common and distinct behavioral patterns across a range of scenarios. The findings provide valuable insights into the strategic preferences of each LLM, highlighting potential implications for their deployment in critical decision-making roles.

How Different AI Chatbots Behave? Benchmarking Large Language Models in Behavioral Economics Games

TL;DR

This study benchmarks five prominent LLM-based chatbot families across six behavioral economics games to characterize their decision-making and behavioral patterns, extending prior work on the behavioral Turing test. By collecting 50 independent responses per model per game and comparing to human distributions, the authors quantify distributional similarity, payoff preferences, and consistency using techniques such as Wasserstein distance and logistic multinomial modeling. Key findings show that while AI chatbots exhibit convergence to specific human-like behavior modes and can pass Turing tests with notable success, they generally produce more concentrated distributions, emphasize fairness, and display inconsistencies across games and models. The work highlights implications for deploying LLMs in high-stakes decision contexts and emphasizes the need for alignment objectives that account for human behavioral diversity and cross-model differences.

Abstract

The deployment of large language models (LLMs) in diverse applications requires a thorough understanding of their decision-making strategies and behavioral patterns. As a supplement to a recent study on the behavioral Turing test, this paper presents a comprehensive analysis of five leading LLM-based chatbot families as they navigate a series of behavioral economics games. By benchmarking these AI chatbots, we aim to uncover and document both common and distinct behavioral patterns across a range of scenarios. The findings provide valuable insights into the strategic preferences of each LLM, highlighting potential implications for their deployment in critical decision-making roles.

Paper Structure

This paper contains 21 sections, 2 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Distributions of AI chatbot behaviors in economics games.
  • Figure 2: The Turing test results.
  • Figure 3: Pairwise behavior distribution dissimilarities estimated with Wasserstein distance.
  • Figure 4: The mean squared error (MSE) of the actual play distribution relative to the best-response utility, when matched with a partner playing the human distribution. The errors are calculated for each possible preference $b$ in the objective function (Eq. \ref{['eq:utility']}), and the average across all games is plotted. Particularly, $b = 1$ corresponds to purely selfish preferences, $b = 0$ represents purely selfless preferences, and $b = 0.5$ reflects a preference for maximizing the combined payoff of both players. $r$ is a specification parameter set as $r=1$ and $r=1/2$.
  • Figure 5: Pairwise behavior distribution dissimilarities estimated with Wasserstein distance.
  • ...and 3 more figures