STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

Narun Raman; Taylor Lundy; Thiago Amin; Jesse Perla; Kevin Leyton-Brown

STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

Narun Raman, Taylor Lundy, Thiago Amin, Jesse Perla, Kevin Leyton-Brown

TL;DR

STEER-ME addresses the need for a comprehensive, non-strategic microeconomic reasoning benchmark for large language models by extending the prior STEER framework to supply-and-demand contexts. It introduces a taxonomy of $58$ non-strategic elements across five settings and employs an automated auto-STEER data-generation pipeline to produce diversified, domain- and perspective-rich MCQA questions evaluated across $27$ models and multiple prompting-adaptation regimes. The study reveals substantial variation in model performance, identifies robust models like $o1$-preview, and highlights systematic error patterns such as miscomputing deadweight loss and overreliance on provided options. By releasing the benchmark, data-generation tools, and evaluation framework, STEER-ME provides a practical resource for ongoing assessment and targeted fine-tuning of economic reasoning in LLMs with potential impact on economic analysis, policy simulations, and market modeling with AI.

Abstract

How should one judge whether a given large language model (LLM) can reliably perform economic reasoning? Most existing LLM benchmarks focus on specific applications and fail to present the model with a rich variety of economic tasks. A notable exception is Raman et al. [2024], who offer an approach for comprehensively benchmarking strategic decision-making; however, this approach fails to address the non-strategic settings prevalent in microeconomics, such as supply-and-demand analysis. We address this gap by taxonomizing microeconomic reasoning into $58$ distinct elements, focusing on the logic of supply and demand, each grounded in up to $10$ distinct domains, $5$ perspectives, and $3$ types. The generation of benchmark data across this combinatorial space is powered by a novel LLM-assisted data generation protocol that we dub auto-STEER, which generates a set of questions by adapting handwritten templates to target new domains and perspectives. Because it offers an automated way of generating fresh questions, auto-STEER mitigates the risk that LLMs will be trained to over-fit evaluation benchmarks; we thus hope that it will serve as a useful tool both for evaluating and fine-tuning models for years to come. We demonstrate the usefulness of our benchmark via a case study on $27$ LLMs, ranging from small open-source models to the current state of the art. We examined each model's ability to solve microeconomic problems across our whole taxonomy and present the results across a range of prompting strategies and scoring metrics.

STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

TL;DR

non-strategic elements across five settings and employs an automated auto-STEER data-generation pipeline to produce diversified, domain- and perspective-rich MCQA questions evaluated across

models and multiple prompting-adaptation regimes. The study reveals substantial variation in model performance, identifies robust models like

-preview, and highlights systematic error patterns such as miscomputing deadweight loss and overreliance on provided options. By releasing the benchmark, data-generation tools, and evaluation framework, STEER-ME provides a practical resource for ongoing assessment and targeted fine-tuning of economic reasoning in LLMs with potential impact on economic analysis, policy simulations, and market modeling with AI.

Abstract

distinct elements, focusing on the logic of supply and demand, each grounded in up to

distinct domains,

perspectives, and

types. The generation of benchmark data across this combinatorial space is powered by a novel LLM-assisted data generation protocol that we dub auto-STEER, which generates a set of questions by adapting handwritten templates to target new domains and perspectives. Because it offers an automated way of generating fresh questions, auto-STEER mitigates the risk that LLMs will be trained to over-fit evaluation benchmarks; we thus hope that it will serve as a useful tool both for evaluating and fine-tuning models for years to come. We demonstrate the usefulness of our benchmark via a case study on

LLMs, ranging from small open-source models to the current state of the art. We examined each model's ability to solve microeconomic problems across our whole taxonomy and present the results across a range of prompting strategies and scoring metrics.

STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

TL;DR

Abstract

STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (26)