Table of Contents
Fetching ...

EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents

Sara Fish, Julia Shephard, Minkai Li, Ran I. Shorrer, Yannai A. Gonczarowski

TL;DR

EconEvals introduces a principled framework to evaluate LLMs in economic decision-making through three stylized benchmarks—procurement, scheduling, and pricing—designed to test learning-from-environment in-context and optimization under uncertainty. Complementing this, the litmus-test framework quantifies tradeoffs across competing objectives via litmus, reliability, and competency scores, enabling analysis of LLMs' decision tendencies beyond capability. Across a broad set of frontier LLMs, results show performance gains over time, with deeper insights drawn from trajectory data and notes analyses that reveal how models learn, reason about preferences, and adapt to non-stationary environments. The study demonstrates that a combination of environment-grounded benchmarks and behavior-focused litmus tests offers a robust, generalizable approach to evaluating AI agents in economic decision-making tasks, with implications for deployment and governance. Overall, EconEvals provides a foundation for quantitatively comparing LLM agents as they increasingly participate in real-world economic workflows, while highlighting the importance of competency and reliability alongside tradable litmus signals for interpreting their behavior.

Abstract

We develop evaluation methods for measuring the economic decision-making capabilities and tendencies of LLMs. First, we develop benchmarks derived from key problems in economics -- procurement, scheduling, and pricing -- that test an LLM's ability to learn from the environment in context. Second, we develop the framework of litmus tests, evaluations that quantify an LLM's choice behavior on a stylized decision-making task with multiple conflicting objectives. Each litmus test outputs a litmus score, which quantifies an LLM's tradeoff response, a reliability score, which measures the coherence of an LLM's choice behavior, and a competency score, which measures an LLM's capability at the same task when the conflicting objectives are replaced by a single, well-specified objective. Evaluating a broad array of frontier LLMs, we (1) investigate changes in LLM capabilities and tendencies over time, (2) derive economically meaningful insights from the LLMs' choice behavior and chain-of-thought, (3) validate our litmus test framework by testing self-consistency, robustness, and generalizability. Overall, this work provides a foundation for evaluating LLM agents as they are further integrated into economic decision-making.

EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents

TL;DR

EconEvals introduces a principled framework to evaluate LLMs in economic decision-making through three stylized benchmarks—procurement, scheduling, and pricing—designed to test learning-from-environment in-context and optimization under uncertainty. Complementing this, the litmus-test framework quantifies tradeoffs across competing objectives via litmus, reliability, and competency scores, enabling analysis of LLMs' decision tendencies beyond capability. Across a broad set of frontier LLMs, results show performance gains over time, with deeper insights drawn from trajectory data and notes analyses that reveal how models learn, reason about preferences, and adapt to non-stationary environments. The study demonstrates that a combination of environment-grounded benchmarks and behavior-focused litmus tests offers a robust, generalizable approach to evaluating AI agents in economic decision-making tasks, with implications for deployment and governance. Overall, EconEvals provides a foundation for quantitatively comparing LLM agents as they increasingly participate in real-world economic workflows, while highlighting the importance of competency and reliability alongside tradable litmus signals for interpreting their behavior.

Abstract

We develop evaluation methods for measuring the economic decision-making capabilities and tendencies of LLMs. First, we develop benchmarks derived from key problems in economics -- procurement, scheduling, and pricing -- that test an LLM's ability to learn from the environment in context. Second, we develop the framework of litmus tests, evaluations that quantify an LLM's choice behavior on a stylized decision-making task with multiple conflicting objectives. Each litmus test outputs a litmus score, which quantifies an LLM's tradeoff response, a reliability score, which measures the coherence of an LLM's choice behavior, and a competency score, which measures an LLM's capability at the same task when the conflicting objectives are replaced by a single, well-specified objective. Evaluating a broad array of frontier LLMs, we (1) investigate changes in LLM capabilities and tendencies over time, (2) derive economically meaningful insights from the LLMs' choice behavior and chain-of-thought, (3) validate our litmus test framework by testing self-consistency, robustness, and generalizability. Overall, this work provides a foundation for evaluating LLM agents as they are further integrated into economic decision-making.

Paper Structure

This paper contains 202 sections, 13 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: Illustration of how the LLM agent interacts with the benchmark environment. The LLM agent obtains information and takes actions via tool use (see \ref{['subsec:benchmark-interaction-method', 'subsec:llm-agent-architecture']}). The environment performs computations based on the tools used and returns information (see \ref{['subsec:economic_environment']}).
  • Figure 2: Benchmark scores of Claude 3.5 Sonnet, Gemini 1.5 Pro, GPT-4o, GPT-4.1, o4-mini, Gemini 2.5 Pro, GPT-5, and Gemini 3 Pro Preview on the three EconEvals benchmark environments---procurement, scheduling, and pricing---at the Hard difficulty level, by LLM release date. Each point represents the average score over 12 randomly generated instances. The highest possible score is 100. The dashed lines represent OLS linear regression fits. More recently-released LLMs generally earn higher benchmark scores, indicating that the capabilities of LLMs in economic decision-making contexts are improving with time.
  • Figure 3: Benchmark scores of Claude 3.5 Sonnet, Gemini 1.5 Pro, GPT-4o, GPT-4.1, o4-mini, Gemini 2.5 Pro, and Gemini 3 Pro Preview on the three EconEvals benchmark environments---procurement, scheduling, and pricing---at the Basic, Medium, and Hard difficulty levels. Each point represents the average score over 12 randomly generated instances; error bars represent standard errors of the mean across instances. The highest possible score is 100. More difficult instances generally lead to lower benchmark scores, indicating that our difficulty scaling technique is effective.
  • Figure 4: Pairwise comparisons between LLM scores on the three benchmark environments. Each off-diagonal entry shows the result of a paired Wilcoxon signed-rank test comparing two LLMs across 36 matched instances (12 instances for each of the three difficulty levels), with significance levels: ***: $p < 0.01$, **: $p < 0.05$, *: $p < 0.1$. Diagonal entries display the mean score on Hard difficulty, which also determines the row/column ordering in each subplot.
  • Figure 5: Consistency of pairwise LLM comparisons across difficulty levels, for each of the three benchmark environments. A cell labeled $X/Y$ denotes: $Y$ (out of the three difficulty levels) comparisons are statistically significant, and $X$ of these go in the majority direction. (For example, if model $A$ outperformed model $B$ at Basic and Medium, but $B$ outperformed $A$ at Hard, this would be labeled 2/3.) All cells are labeled 0/0, 1/1, 2/2, or 3/3, indicating that comparisons are consistent whenever they are statistically significant (at the $p < 0.05$ level).
  • ...and 12 more figures