Table of Contents
Fetching ...

LemonadeBench: Evaluating the Economic Intuition of Large Language Models in Simple Markets

Aidan Vyas

TL;DR

The paper introduces LemonadeBench v0.5, a 30-day lemonade stand benchmark to evaluate LLMs on economic intuition, long-term planning, and decision-making under uncertainty using a perishable inventory setting. Demand is modeled as $Q(p,h) = (50 - 10p) \cdot m_h \cdot \epsilon_h$ with hourly multipliers and stochastic noise, and the theoretical optimum price is $p^* = \$2.69$, yielding substantial but finite profit. A six-dimensional efficiency framework (Purchasing, Expired, Excess, Stockout, Pricing, Scheduling) reveals that models achieve profitability through diverse, often locally optimized strategies rather than global optimization, with the best model reaching about 70% of the theoretical optimum and universal underpricing observed. The work discusses implications for AI economic competence, highlights biases and measurement challenges (notably stockout-dominated losses), and outlines a roadmap to scale the benchmark with decade-long horizons, multi-location setups, and multi-agent dynamics. Code and data are released to support replication and future benchmarking efforts.

Abstract

We introduce LemonadeBench v0.5, a minimal benchmark for evaluating economic intuition, long-term planning, and decision-making under uncertainty in large language models (LLMs) through a simulated lemonade stand business. Models must manage inventory with expiring goods, set prices, choose operating hours, and maximize profit over a 30-day period-tasks that any small business owner faces daily. All models demonstrate meaningful economic agency by achieving profitability, with performance scaling dramatically by sophistication-from basic models earning minimal profits to frontier models capturing 70% of theoretical optimal, a greater than 10x improvement. Yet our decomposition of business efficiency across six dimensions reveals a consistent pattern: models achieve local rather than global optimization, excelling in select areas while exhibiting surprising blind spots elsewhere.

LemonadeBench: Evaluating the Economic Intuition of Large Language Models in Simple Markets

TL;DR

The paper introduces LemonadeBench v0.5, a 30-day lemonade stand benchmark to evaluate LLMs on economic intuition, long-term planning, and decision-making under uncertainty using a perishable inventory setting. Demand is modeled as with hourly multipliers and stochastic noise, and the theoretical optimum price is 2.69$, yielding substantial but finite profit. A six-dimensional efficiency framework (Purchasing, Expired, Excess, Stockout, Pricing, Scheduling) reveals that models achieve profitability through diverse, often locally optimized strategies rather than global optimization, with the best model reaching about 70% of the theoretical optimum and universal underpricing observed. The work discusses implications for AI economic competence, highlights biases and measurement challenges (notably stockout-dominated losses), and outlines a roadmap to scale the benchmark with decade-long horizons, multi-location setups, and multi-agent dynamics. Code and data are released to support replication and future benchmarking efforts.

Abstract

We introduce LemonadeBench v0.5, a minimal benchmark for evaluating economic intuition, long-term planning, and decision-making under uncertainty in large language models (LLMs) through a simulated lemonade stand business. Models must manage inventory with expiring goods, set prices, choose operating hours, and maximize profit over a 30-day period-tasks that any small business owner faces daily. All models demonstrate meaningful economic agency by achieving profitability, with performance scaling dramatically by sophistication-from basic models earning minimal profits to frontier models capturing 70% of theoretical optimal, a greater than 10x improvement. Yet our decomposition of business efficiency across six dimensions reveals a consistent pattern: models achieve local rather than global optimization, excelling in select areas while exhibiting surprising blind spots elsewhere.
Paper Structure (19 sections, 16 equations, 1 figure, 3 tables)