Table of Contents
Fetching ...

Economics Arena for Large Language Models

Shangmin Guo, Haoran Bu, Haochuan Wang, Yi Ren, Dianbo Sui, Yuming Shang, Siting Lu

TL;DR

This work introduces EconArena, a dynamic, multi-agent evaluation framework for large language models using numeral-based competitive games (beauty contests and second-price auctions) to assess rationality, strategic reasoning, and instruction-following. It demonstrates that LLMs often deviate from Nash equilibria in one-shot play but can converge toward NE and win more often when history is provided, with models like GPT-4 showing faster adaptation. The study provides a metric suite and an open simulation package for evaluating LLMs under dynamic environments, enabling cross-model comparisons and insights into how instruction complexity and history affect performance. The findings highlight variability in rationality and instruction-following across models, suggesting directions for improving LLM-based agents in competitive settings.

Abstract

Large language models (LLMs) have been extensively used as the backbones for general-purpose agents, and some economics literature suggest that LLMs are capable of playing various types of economics games. Following these works, to overcome the limitation of evaluating LLMs using static benchmarks, we propose to explore competitive games as an evaluation for LLMs to incorporate multi-players and dynamicise the environment. By varying the game history revealed to LLMs-based players, we find that most of LLMs are rational in that they play strategies that can increase their payoffs, but not as rational as indicated by Nash Equilibria (NEs). Moreover, when game history are available, certain types of LLMs, such as GPT-4, can converge faster to the NE strategies, which suggests higher rationality level in comparison to other models. In the meantime, certain types of LLMs can win more often when game history are available, and we argue that the winning rate reflects the reasoning ability with respect to the strategies of other players. Throughout all our experiments, we observe that the ability to strictly follow the game rules described by natural languages also vary among the LLMs we tested. In this work, we provide an economics arena for the LLMs research community as a dynamic simulation to test the above-mentioned abilities of LLMs, i.e. rationality, strategic reasoning ability, and instruction-following capability.

Economics Arena for Large Language Models

TL;DR

This work introduces EconArena, a dynamic, multi-agent evaluation framework for large language models using numeral-based competitive games (beauty contests and second-price auctions) to assess rationality, strategic reasoning, and instruction-following. It demonstrates that LLMs often deviate from Nash equilibria in one-shot play but can converge toward NE and win more often when history is provided, with models like GPT-4 showing faster adaptation. The study provides a metric suite and an open simulation package for evaluating LLMs under dynamic environments, enabling cross-model comparisons and insights into how instruction complexity and history affect performance. The findings highlight variability in rationality and instruction-following across models, suggesting directions for improving LLM-based agents in competitive settings.

Abstract

Large language models (LLMs) have been extensively used as the backbones for general-purpose agents, and some economics literature suggest that LLMs are capable of playing various types of economics games. Following these works, to overcome the limitation of evaluating LLMs using static benchmarks, we propose to explore competitive games as an evaluation for LLMs to incorporate multi-players and dynamicise the environment. By varying the game history revealed to LLMs-based players, we find that most of LLMs are rational in that they play strategies that can increase their payoffs, but not as rational as indicated by Nash Equilibria (NEs). Moreover, when game history are available, certain types of LLMs, such as GPT-4, can converge faster to the NE strategies, which suggests higher rationality level in comparison to other models. In the meantime, certain types of LLMs can win more often when game history are available, and we argue that the winning rate reflects the reasoning ability with respect to the strategies of other players. Throughout all our experiments, we observe that the ability to strictly follow the game rules described by natural languages also vary among the LLMs we tested. In this work, we provide an economics arena for the LLMs research community as a dynamic simulation to test the above-mentioned abilities of LLMs, i.e. rationality, strategic reasoning ability, and instruction-following capability.
Paper Structure (28 sections, 4 equations, 9 figures, 2 tables)

This paper contains 28 sections, 4 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Diagram of EconArena which is constituted by three major modules: i) hosts; ii) agents; and iii) games. The hosts are responsible for running the games, interacting with the LLMs through APIs, collecting and returning game results. The agents are wrappers of APIs of various LLMs, and the games describe the rules of various economics games.
  • Figure 2: The performance of LLMs in beauty contest games playing against different types of opponents. By "Melee environment", we refer to the case where the LLMs are playing against each other, while in the "Rational environment", the LLMs are playing against 4 hard-coded rational agents. Note that some results for ChatGLM2 and Llama2 are not recorded because they failed to complete the games.
  • Figure 3: The performance of LLMs in second price auction games playing against different types of opponents. By "Melee environment", we refer to the case where LLMs are playing against each other, while in the "Rational environment", the LLMs are playing against 4 hard-coded rational agents. Note that Figure \ref{['fig:method:games:second_auction_diff_opponents:rational']} is a vilolin graph, same as Figure \ref{['fig:results:games:beauty_contest_diff_opponents:rational1']}.
  • Figure 4: Average payoffs ($\uparrow$) of LLM when varying game configuration. In the beauty contest setup, we change the upper bound of the interval from which an agent chooses a number, while in the second price auctions, we vary the private value signals ($s$) and asset level ($A$).
  • Figure 5: Deviation distance ($\downarrow$) from NEs in the "Rational environment" with history. In these experiments, we reveal a maximum 3 runs of history to the LLMs.
  • ...and 4 more figures