Table of Contents
Fetching ...

Cost-of-Pass: An Economic Framework for Evaluating Language Models

Mehmet Hamza Erol, Batu El, Mirac Suzgun, Mert Yuksekgonul, James Zou

TL;DR

This work introduces cost-of-pass, an economically grounded metric that combines LM accuracy with inference cost to measure the monetary effort required to obtain correct solutions. It embeds this metric in production-frontier theory, defining an LM frontier cost-of-pass and a human-expert baseline to assess true economic value across tasks. Through extensive experiments across lightweight, large, and reasoning models on basic, knowledge-based, and complex quantitative tasks, the authors show distinct model-family effectiveness and that frontier costs have fallen notably over time, especially for complex tasks, driven mainly by model-level innovations rather than inference-time techniques. The framework and analyses offer a principled tool for model deployment decisions and for diagnosing the sources of progress in cost-efficiency across the LM ecosystem.

Abstract

The widespread adoption of AI systems in the economy hinges on their ability to generate economic value that outweighs their inference costs. Evaluating this tradeoff requires metrics that account for both performance and costs. We propose a framework grounded in production theory for evaluating language models by combining accuracy and inference cost. We introduce "cost-of-pass", the expected monetary cost of generating a correct solution. We then define the "frontier cost-of-pass" as the minimum cost-of-pass achievable across available models or the "human-expert, using the approximate cost of hiring an expert. Our analysis reveals distinct economic insights. First, lightweight models are most cost-effective for basic quantitative tasks, large models for knowledge-intensive ones, and reasoning models for complex quantitative problems, despite higher per-token costs. Second, tracking this frontier cost-of-pass over the past year reveals significant progress, particularly for complex quantitative tasks where the cost has roughly halved every few months. Third, to trace key innovations driving this progress, we examine counterfactual frontiers: estimates of cost-efficiency without specific model classes. We find that innovations in lightweight, large, and reasoning models have been essential for pushing the frontier in basic quantitative, knowledge-intensive, and complex quantitative tasks, respectively. Finally, we assess the cost-reductions afforded by common inference-time techniques like majority voting and self-refinement, finding that their marginal accuracy gains rarely justify their costs. Our findings underscore that complementary model-level innovations are the primary drivers of cost-efficiency, and our economic framework provides a principled tool for measuring this progress and guiding deployment.

Cost-of-Pass: An Economic Framework for Evaluating Language Models

TL;DR

This work introduces cost-of-pass, an economically grounded metric that combines LM accuracy with inference cost to measure the monetary effort required to obtain correct solutions. It embeds this metric in production-frontier theory, defining an LM frontier cost-of-pass and a human-expert baseline to assess true economic value across tasks. Through extensive experiments across lightweight, large, and reasoning models on basic, knowledge-based, and complex quantitative tasks, the authors show distinct model-family effectiveness and that frontier costs have fallen notably over time, especially for complex tasks, driven mainly by model-level innovations rather than inference-time techniques. The framework and analyses offer a principled tool for model deployment decisions and for diagnosing the sources of progress in cost-efficiency across the LM ecosystem.

Abstract

The widespread adoption of AI systems in the economy hinges on their ability to generate economic value that outweighs their inference costs. Evaluating this tradeoff requires metrics that account for both performance and costs. We propose a framework grounded in production theory for evaluating language models by combining accuracy and inference cost. We introduce "cost-of-pass", the expected monetary cost of generating a correct solution. We then define the "frontier cost-of-pass" as the minimum cost-of-pass achievable across available models or the "human-expert, using the approximate cost of hiring an expert. Our analysis reveals distinct economic insights. First, lightweight models are most cost-effective for basic quantitative tasks, large models for knowledge-intensive ones, and reasoning models for complex quantitative problems, despite higher per-token costs. Second, tracking this frontier cost-of-pass over the past year reveals significant progress, particularly for complex quantitative tasks where the cost has roughly halved every few months. Third, to trace key innovations driving this progress, we examine counterfactual frontiers: estimates of cost-efficiency without specific model classes. We find that innovations in lightweight, large, and reasoning models have been essential for pushing the frontier in basic quantitative, knowledge-intensive, and complex quantitative tasks, respectively. Finally, we assess the cost-reductions afforded by common inference-time techniques like majority voting and self-refinement, finding that their marginal accuracy gains rarely justify their costs. Our findings underscore that complementary model-level innovations are the primary drivers of cost-efficiency, and our economic framework provides a principled tool for measuring this progress and guiding deployment.

Paper Structure

This paper contains 24 sections, 16 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Highlights of the cost-of-pass framework and empirical analyses. Core concepts (left) set foundations for: (A) Comparing the Human Expert Baseline to the frontier achieved by the single most effective LM per task category. (B) Tracking the reduction in frontier cost-of-pass over time, indicating progress driven by new model releases (color-coded by family). (C) Quantifying the essential contribution of each model family: lightweight (less than $1 per million tokens), large, and reasoning; to the current cost-efficiency frontier, measured by the percentage of each family's contribution. (D) Assessing the economic benefit (relative cost reduction) achieved by applying common inference-time techniques over the baseline model frontier (which rarely results in meaningful gains).
  • Figure 2: The frontier dollar cost-of-pass (i.e. $V_{p \sim D}(\mathcal{M}_t)$ steadily decreases with new model releases, spanning models released between May 2024 and February 2025. Y-axes are normalized (divided by $V_{p \sim D}(\mathcal{M}_0)$, shown in percentage (%)).
  • Figure 3: The relative improvement (%) in frontier cost-of-pass attributable to each model family $g$, calculated under a counterfactual setting where $\mathcal{M}_g$ is removed. Higher values signify greater essentialness for maintaining the current frontier.
  • Figure 4: Bar plot showing the percentage of change in frontier cost-of-pass per model release (i.e. $\frac{G_{p \sim D}(\{m_t\}, \mathcal{M}_{t-1})}{V_{p \sim D}(\mathcal{M}_{t-1})}$)
  • Figure 5: The relative improvement (%) in frontier cost-of-pass under a counterfactual setting, removing a model $m_*$ from the model set $\mathcal{M}_T$. High values mean that the model is essential for maintaining the current frontier.