LemonadeBench: Evaluating the Economic Intuition of Large Language Models in Simple Markets

Aidan Vyas

LemonadeBench: Evaluating the Economic Intuition of Large Language Models in Simple Markets

Aidan Vyas

TL;DR

The paper introduces LemonadeBench v0.5, a 30-day lemonade stand benchmark to evaluate LLMs on economic intuition, long-term planning, and decision-making under uncertainty using a perishable inventory setting. Demand is modeled as $Q(p,h) = (50 - 10p) \cdot m_h \cdot \epsilon_h$ with hourly multipliers and stochastic noise, and the theoretical optimum price is $p^* = \$2.69$, yielding substantial but finite profit. A six-dimensional efficiency framework (Purchasing, Expired, Excess, Stockout, Pricing, Scheduling) reveals that models achieve profitability through diverse, often locally optimized strategies rather than global optimization, with the best model reaching about 70% of the theoretical optimum and universal underpricing observed. The work discusses implications for AI economic competence, highlights biases and measurement challenges (notably stockout-dominated losses), and outlines a roadmap to scale the benchmark with decade-long horizons, multi-location setups, and multi-agent dynamics. Code and data are released to support replication and future benchmarking efforts.

Abstract

We introduce LemonadeBench v0.5, a minimal benchmark for evaluating economic intuition, long-term planning, and decision-making under uncertainty in large language models (LLMs) through a simulated lemonade stand business. Models must manage inventory with expiring goods, set prices, choose operating hours, and maximize profit over a 30-day period-tasks that any small business owner faces daily. All models demonstrate meaningful economic agency by achieving profitability, with performance scaling dramatically by sophistication-from basic models earning minimal profits to frontier models capturing 70% of theoretical optimal, a greater than 10x improvement. Yet our decomposition of business efficiency across six dimensions reveals a consistent pattern: models achieve local rather than global optimization, excelling in select areas while exhibiting surprising blind spots elsewhere.

LemonadeBench: Evaluating the Economic Intuition of Large Language Models in Simple Markets

TL;DR

with hourly multipliers and stochastic noise, and the theoretical optimum price is

2.69$, yielding substantial but finite profit. A six-dimensional efficiency framework (Purchasing, Expired, Excess, Stockout, Pricing, Scheduling) reveals that models achieve profitability through diverse, often locally optimized strategies rather than global optimization, with the best model reaching about 70% of the theoretical optimum and universal underpricing observed. The work discusses implications for AI economic competence, highlights biases and measurement challenges (notably stockout-dominated losses), and outlines a roadmap to scale the benchmark with decade-long horizons, multi-location setups, and multi-agent dynamics. Code and data are released to support replication and future benchmarking efforts.

Abstract

Paper Structure (19 sections, 16 equations, 1 figure, 3 tables)

This paper contains 19 sections, 16 equations, 1 figure, 3 tables.

Introduction
Related Work
LemonadeBench v0.5
Game Mechanics
Demand Function
Inventory Management
Optimal Pricing
Model Scaffolding
Results
Overall Performance
Business Efficiency Analysis
Efficiency Metric Definitions
Model Performance Patterns
Implications for AI Economic Competence
Computational Requirements
...and 4 more sections

Figures (1)

Figure 1: Daily profit trajectories showing strategic evolution across models

LemonadeBench: Evaluating the Economic Intuition of Large Language Models in Simple Markets

TL;DR

Abstract

LemonadeBench: Evaluating the Economic Intuition of Large Language Models in Simple Markets

Authors

TL;DR

Abstract

Table of Contents

Figures (1)