EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents
Sara Fish, Julia Shephard, Minkai Li, Ran I. Shorrer, Yannai A. Gonczarowski
TL;DR
EconEvals introduces a principled framework to evaluate LLMs in economic decision-making through three stylized benchmarks—procurement, scheduling, and pricing—designed to test learning-from-environment in-context and optimization under uncertainty. Complementing this, the litmus-test framework quantifies tradeoffs across competing objectives via litmus, reliability, and competency scores, enabling analysis of LLMs' decision tendencies beyond capability. Across a broad set of frontier LLMs, results show performance gains over time, with deeper insights drawn from trajectory data and notes analyses that reveal how models learn, reason about preferences, and adapt to non-stationary environments. The study demonstrates that a combination of environment-grounded benchmarks and behavior-focused litmus tests offers a robust, generalizable approach to evaluating AI agents in economic decision-making tasks, with implications for deployment and governance. Overall, EconEvals provides a foundation for quantitatively comparing LLM agents as they increasingly participate in real-world economic workflows, while highlighting the importance of competency and reliability alongside tradable litmus signals for interpreting their behavior.
Abstract
We develop evaluation methods for measuring the economic decision-making capabilities and tendencies of LLMs. First, we develop benchmarks derived from key problems in economics -- procurement, scheduling, and pricing -- that test an LLM's ability to learn from the environment in context. Second, we develop the framework of litmus tests, evaluations that quantify an LLM's choice behavior on a stylized decision-making task with multiple conflicting objectives. Each litmus test outputs a litmus score, which quantifies an LLM's tradeoff response, a reliability score, which measures the coherence of an LLM's choice behavior, and a competency score, which measures an LLM's capability at the same task when the conflicting objectives are replaced by a single, well-specified objective. Evaluating a broad array of frontier LLMs, we (1) investigate changes in LLM capabilities and tendencies over time, (2) derive economically meaningful insights from the LLMs' choice behavior and chain-of-thought, (3) validate our litmus test framework by testing self-consistency, robustness, and generalizability. Overall, this work provides a foundation for evaluating LLM agents as they are further integrated into economic decision-making.
