Table of Contents
Fetching ...

STEER: Assessing the Economic Rationality of Large Language Models

Narun Raman, Taylor Lundy, Samuel Amouyal, Yoav Levine, Kevin Leyton-Brown, Moshe Tennenholtz

TL;DR

STEER introduces a principled benchmark for assessing the economic rationality of large language models by linking decision-making behavior to von Neumann–Morgenstern utility axioms through a 64-element taxonomy. It builds a scalable MCQA-based benchmark with graded domains and difficulty, generating thousands of questions (e.g., 24,500 per element for 49 elements) and validating them with human validators, then reports results via customizable STEER report cards that include accuracy and calibration metrics plus domain and dependency robustness. Empirical results across 14 LLMs show model size correlates with performance, with GPT-4 Turbo consistently outperforming others and self-explanation plus limited few-shot prompting providing benefits in many settings. The framework, open data, and web interface enable researchers to probe rationality in economic contexts, identify weaknesses, and guide future improvements in LLM decision-making and mechanism design applications.

Abstract

There is increasing interest in using LLMs as decision-making "agents." Doing so includes many degrees of freedom: which model should be used; how should it be prompted; should it be asked to introspect, conduct chain-of-thought reasoning, etc? Settling these questions -- and more broadly, determining whether an LLM agent is reliable enough to be trusted -- requires a methodology for assessing such an agent's economic rationality. In this paper, we provide one. We begin by surveying the economic literature on rational decision making, taxonomizing a large set of fine-grained "elements" that an agent should exhibit, along with dependencies between them. We then propose a benchmark distribution that quantitatively scores an LLMs performance on these elements and, combined with a user-provided rubric, produces a "STEER report card." Finally, we describe the results of a large-scale empirical experiment with 14 different LLMs, characterizing the both current state of the art and the impact of different model sizes on models' ability to exhibit rational behavior.

STEER: Assessing the Economic Rationality of Large Language Models

TL;DR

STEER introduces a principled benchmark for assessing the economic rationality of large language models by linking decision-making behavior to von Neumann–Morgenstern utility axioms through a 64-element taxonomy. It builds a scalable MCQA-based benchmark with graded domains and difficulty, generating thousands of questions (e.g., 24,500 per element for 49 elements) and validating them with human validators, then reports results via customizable STEER report cards that include accuracy and calibration metrics plus domain and dependency robustness. Empirical results across 14 LLMs show model size correlates with performance, with GPT-4 Turbo consistently outperforming others and self-explanation plus limited few-shot prompting providing benefits in many settings. The framework, open data, and web interface enable researchers to probe rationality in economic contexts, identify weaknesses, and guide future improvements in LLM decision-making and mechanism design applications.

Abstract

There is increasing interest in using LLMs as decision-making "agents." Doing so includes many degrees of freedom: which model should be used; how should it be prompted; should it be asked to introspect, conduct chain-of-thought reasoning, etc? Settling these questions -- and more broadly, determining whether an LLM agent is reliable enough to be trusted -- requires a methodology for assessing such an agent's economic rationality. In this paper, we provide one. We begin by surveying the economic literature on rational decision making, taxonomizing a large set of fine-grained "elements" that an agent should exhibit, along with dependencies between them. We then propose a benchmark distribution that quantitatively scores an LLMs performance on these elements and, combined with a user-provided rubric, produces a "STEER report card." Finally, we describe the results of a large-scale empirical experiment with 14 different LLMs, characterizing the both current state of the art and the impact of different model sizes on models' ability to exhibit rational behavior.
Paper Structure (71 sections, 24 figures, 3 tables)

This paper contains 71 sections, 24 figures, 3 tables.

Figures (24)

  • Figure 1: High-level diagram of the taxonomy of elements of rationality. At the top level, we divide the space of decision making into settings: Foundations, Decisions in Single-Agent Environments, Decisions in Multi-Agent Environments, Decisions on Behalf of Other Agents; we further subdivide settings into modules (e.g., Cognitive Biases in Deterministic Environments) that capture conceptually similar behaviors.
  • Figure 2: Example generation of a question in Avoidance of Sunk Cost Fallacy. The generation has two parts: (1) a user prompt containing a template question and instructions to follow the formatting and style of the template; and (2) a static system prompt. The template question in this example is in domain created a project and at grade level 5.
  • Figure 3: Example domain categorization of questions within elements of rationality and an example of two questions in different tests for the element \ref{['el:maximize-expected-utility']}. Top: We instantiate questions into as many domains as makes sense for the element of rationality. This figure depicts the domain span of questions for four different elements. Bottom: Two questions in two domains: job offers and medical devices, and two grade levels. Here, a higher grade level means more outcomes in the options. On the right (Grade Level 4), we see two options each with two outcomes, whereas the one on the left (Grade Level 3) has one option with two outcomes and the other with one.
  • Figure 4: Example SRC for GPT-3.5 Turbo. On the left are the components necessary for measuring performance: a grade range, a set of domains, a performance metric and a model. The rest of the report card summarizes performance over the entire dataset, grouped by settings and modules in which there are questions in the grade range. In this example, GPT-3.5 Turbo was evaluated on all domains in grades 2--8 and given a single example as part of the prompt of a task. The right-most pane drills into Decisions in Multi-Agent Environments; crossed out text illustrates tasks omitted due to our grade level filter.
  • Figure 5: Dependency subgraph for iterated removal of dominated strategies in two-agent games with two actions for each agent. This node requires the ability to interpret the format of a game matrix (both normal and bimatrix form), correctly answer first-order false belief tasks (have knowledge of others' beliefs), best respond (choose the best action given a fixed action for the opponent), and find dominated strategies. Being able to find dominated strategies requires the ability to have orderings over payoffs which is tested via transitivity and ignoring irrelevant alternatives. The remaining nodes can similarly be broken down.
  • ...and 19 more figures