STEER: Assessing the Economic Rationality of Large Language Models
Narun Raman, Taylor Lundy, Samuel Amouyal, Yoav Levine, Kevin Leyton-Brown, Moshe Tennenholtz
TL;DR
STEER introduces a principled benchmark for assessing the economic rationality of large language models by linking decision-making behavior to von Neumann–Morgenstern utility axioms through a 64-element taxonomy. It builds a scalable MCQA-based benchmark with graded domains and difficulty, generating thousands of questions (e.g., 24,500 per element for 49 elements) and validating them with human validators, then reports results via customizable STEER report cards that include accuracy and calibration metrics plus domain and dependency robustness. Empirical results across 14 LLMs show model size correlates with performance, with GPT-4 Turbo consistently outperforming others and self-explanation plus limited few-shot prompting providing benefits in many settings. The framework, open data, and web interface enable researchers to probe rationality in economic contexts, identify weaknesses, and guide future improvements in LLM decision-making and mechanism design applications.
Abstract
There is increasing interest in using LLMs as decision-making "agents." Doing so includes many degrees of freedom: which model should be used; how should it be prompted; should it be asked to introspect, conduct chain-of-thought reasoning, etc? Settling these questions -- and more broadly, determining whether an LLM agent is reliable enough to be trusted -- requires a methodology for assessing such an agent's economic rationality. In this paper, we provide one. We begin by surveying the economic literature on rational decision making, taxonomizing a large set of fine-grained "elements" that an agent should exhibit, along with dependencies between them. We then propose a benchmark distribution that quantitatively scores an LLMs performance on these elements and, combined with a user-provided rubric, produces a "STEER report card." Finally, we describe the results of a large-scale empirical experiment with 14 different LLMs, characterizing the both current state of the art and the impact of different model sizes on models' ability to exhibit rational behavior.
