Economics Arena for Large Language Models
Shangmin Guo, Haoran Bu, Haochuan Wang, Yi Ren, Dianbo Sui, Yuming Shang, Siting Lu
TL;DR
This work introduces EconArena, a dynamic, multi-agent evaluation framework for large language models using numeral-based competitive games (beauty contests and second-price auctions) to assess rationality, strategic reasoning, and instruction-following. It demonstrates that LLMs often deviate from Nash equilibria in one-shot play but can converge toward NE and win more often when history is provided, with models like GPT-4 showing faster adaptation. The study provides a metric suite and an open simulation package for evaluating LLMs under dynamic environments, enabling cross-model comparisons and insights into how instruction complexity and history affect performance. The findings highlight variability in rationality and instruction-following across models, suggesting directions for improving LLM-based agents in competitive settings.
Abstract
Large language models (LLMs) have been extensively used as the backbones for general-purpose agents, and some economics literature suggest that LLMs are capable of playing various types of economics games. Following these works, to overcome the limitation of evaluating LLMs using static benchmarks, we propose to explore competitive games as an evaluation for LLMs to incorporate multi-players and dynamicise the environment. By varying the game history revealed to LLMs-based players, we find that most of LLMs are rational in that they play strategies that can increase their payoffs, but not as rational as indicated by Nash Equilibria (NEs). Moreover, when game history are available, certain types of LLMs, such as GPT-4, can converge faster to the NE strategies, which suggests higher rationality level in comparison to other models. In the meantime, certain types of LLMs can win more often when game history are available, and we argue that the winning rate reflects the reasoning ability with respect to the strategies of other players. Throughout all our experiments, we observe that the ability to strictly follow the game rules described by natural languages also vary among the LLMs we tested. In this work, we provide an economics arena for the LLMs research community as a dynamic simulation to test the above-mentioned abilities of LLMs, i.e. rationality, strategic reasoning ability, and instruction-following capability.
