Table of Contents
Fetching ...

STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

Narun Raman, Taylor Lundy, Thiago Amin, Jesse Perla, Kevin Leyton-Brown

TL;DR

STEER-ME addresses the need for a comprehensive, non-strategic microeconomic reasoning benchmark for large language models by extending the prior STEER framework to supply-and-demand contexts. It introduces a taxonomy of $58$ non-strategic elements across five settings and employs an automated auto-STEER data-generation pipeline to produce diversified, domain- and perspective-rich MCQA questions evaluated across $27$ models and multiple prompting-adaptation regimes. The study reveals substantial variation in model performance, identifies robust models like $o1$-preview, and highlights systematic error patterns such as miscomputing deadweight loss and overreliance on provided options. By releasing the benchmark, data-generation tools, and evaluation framework, STEER-ME provides a practical resource for ongoing assessment and targeted fine-tuning of economic reasoning in LLMs with potential impact on economic analysis, policy simulations, and market modeling with AI.

Abstract

How should one judge whether a given large language model (LLM) can reliably perform economic reasoning? Most existing LLM benchmarks focus on specific applications and fail to present the model with a rich variety of economic tasks. A notable exception is Raman et al. [2024], who offer an approach for comprehensively benchmarking strategic decision-making; however, this approach fails to address the non-strategic settings prevalent in microeconomics, such as supply-and-demand analysis. We address this gap by taxonomizing microeconomic reasoning into $58$ distinct elements, focusing on the logic of supply and demand, each grounded in up to $10$ distinct domains, $5$ perspectives, and $3$ types. The generation of benchmark data across this combinatorial space is powered by a novel LLM-assisted data generation protocol that we dub auto-STEER, which generates a set of questions by adapting handwritten templates to target new domains and perspectives. Because it offers an automated way of generating fresh questions, auto-STEER mitigates the risk that LLMs will be trained to over-fit evaluation benchmarks; we thus hope that it will serve as a useful tool both for evaluating and fine-tuning models for years to come. We demonstrate the usefulness of our benchmark via a case study on $27$ LLMs, ranging from small open-source models to the current state of the art. We examined each model's ability to solve microeconomic problems across our whole taxonomy and present the results across a range of prompting strategies and scoring metrics.

STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

TL;DR

STEER-ME addresses the need for a comprehensive, non-strategic microeconomic reasoning benchmark for large language models by extending the prior STEER framework to supply-and-demand contexts. It introduces a taxonomy of non-strategic elements across five settings and employs an automated auto-STEER data-generation pipeline to produce diversified, domain- and perspective-rich MCQA questions evaluated across models and multiple prompting-adaptation regimes. The study reveals substantial variation in model performance, identifies robust models like -preview, and highlights systematic error patterns such as miscomputing deadweight loss and overreliance on provided options. By releasing the benchmark, data-generation tools, and evaluation framework, STEER-ME provides a practical resource for ongoing assessment and targeted fine-tuning of economic reasoning in LLMs with potential impact on economic analysis, policy simulations, and market modeling with AI.

Abstract

How should one judge whether a given large language model (LLM) can reliably perform economic reasoning? Most existing LLM benchmarks focus on specific applications and fail to present the model with a rich variety of economic tasks. A notable exception is Raman et al. [2024], who offer an approach for comprehensively benchmarking strategic decision-making; however, this approach fails to address the non-strategic settings prevalent in microeconomics, such as supply-and-demand analysis. We address this gap by taxonomizing microeconomic reasoning into distinct elements, focusing on the logic of supply and demand, each grounded in up to distinct domains, perspectives, and types. The generation of benchmark data across this combinatorial space is powered by a novel LLM-assisted data generation protocol that we dub auto-STEER, which generates a set of questions by adapting handwritten templates to target new domains and perspectives. Because it offers an automated way of generating fresh questions, auto-STEER mitigates the risk that LLMs will be trained to over-fit evaluation benchmarks; we thus hope that it will serve as a useful tool both for evaluating and fine-tuning models for years to come. We demonstrate the usefulness of our benchmark via a case study on LLMs, ranging from small open-source models to the current state of the art. We examined each model's ability to solve microeconomic problems across our whole taxonomy and present the results across a range of prompting strategies and scoring metrics.

Paper Structure

This paper contains 78 sections, 72 equations, 26 figures, 7 tables.

Figures (26)

  • Figure 1: The web app user interface for template AI double-checking. This page instantiates and fills a set of question using a generated or example seed and then generates a response using an OpenAI model. The page also reports the number of questions answered correctly as well as the responses from the model.
  • Figure 2: This figure depicts two questions in the consumer surplus element with different domains and perspectives. The text colored in red are the labeled fields that will be filled for test time and the text in blue is the perspective. On top, a question is framed in the education domain from a third-person woman perspective, while on the bottom, the same question is written for the sports domain from a third person man perspective. These were both generated during the style-transfer step in the data generation process.
  • Figure 3: This figure plots a heatmap of the closed-source LLM performance measured with normalized accuracy on the $30$ elements we instantiated. The , on the y-axis, are sorted in terms of parameter size. The elements, on the x-axis, are grouped by setting.
  • Figure 4: Scatter plot of calibrated performance on the Exponents element versus downstream performance gap across models. The x-axis shows the gap calculated as the quotient between a model's accuracy on real-valued exponent‐based (Cobb–Douglas) tasks and its accuracy on the linear version of those tasks for various downstream elements. The y-axis represents the model's performance on Exponents normalized by dividing by its average accuracy on the benchmark. Each point corresponds to a specific (model, downstream element) pair, with colors distinguishing different models.
  • Figure 5: Heatmap plotting normalized accuracy performance of on elements within the Decisions on Consumption in Non-Strategic Environments setting. Performance is on the shown adaptation without CAR and we sort by parameter size (when available).
  • ...and 21 more figures