Table of Contents
Fetching ...

The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference

Hans Gundlach, Jayson Lynch, Matthias Mertens, Neil Thompson

TL;DR

The paper addresses the real-world efficiency of AI inference by measuring how much it costs to achieve a given benchmark performance. Using a large dataset of benchmark prices from Epoch AI and Internet Archive, the authors apply regression analyses to quantify price trends, distinguishing frontier (open- and closed-weight) models from the broader model set. They find frontier models offer price-perf improvements of about 5-10x per year, with algorithmic progress (hardware-adjusted) around 3x per year for open-weight models, while benchmarking costs often rise or stay flat, offsetting some gains. The study argues for transparent reporting of computational resources in evaluations to better reflect practical impact and guide future benchmarking practices.

Abstract

Language models have seen enormous progress on advanced benchmarks in recent years, but much of this progress has only been possible by using more costly models. Benchmarks may therefore present a warped picture of progress in practical capabilities per dollar. To remedy this, we use data from Artificial Analysis and Epoch AI to form the largest dataset of current and historical prices to run benchmarks to date. We find that the price for a given level of benchmark performance has decreased remarkably fast, around $5\times$ to $10\times$ per year, for frontier models on knowledge, reasoning, math, and software engineering benchmarks. These reductions in the cost of AI inference are due to economic forces, hardware efficiency improvements, and algorithmic efficiency improvements. Isolating out open models to control for competition effects and dividing by hardware price declines, we estimate that algorithmic efficiency progress is around $3\times$ per year. Finally, we recommend that evaluators both publicize and take into account the price of benchmarking as an essential part of measuring the real-world impact of AI.

The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference

TL;DR

The paper addresses the real-world efficiency of AI inference by measuring how much it costs to achieve a given benchmark performance. Using a large dataset of benchmark prices from Epoch AI and Internet Archive, the authors apply regression analyses to quantify price trends, distinguishing frontier (open- and closed-weight) models from the broader model set. They find frontier models offer price-perf improvements of about 5-10x per year, with algorithmic progress (hardware-adjusted) around 3x per year for open-weight models, while benchmarking costs often rise or stay flat, offsetting some gains. The study argues for transparent reporting of computational resources in evaluations to better reflect practical impact and guide future benchmarking practices.

Abstract

Language models have seen enormous progress on advanced benchmarks in recent years, but much of this progress has only been possible by using more costly models. Benchmarks may therefore present a warped picture of progress in practical capabilities per dollar. To remedy this, we use data from Artificial Analysis and Epoch AI to form the largest dataset of current and historical prices to run benchmarks to date. We find that the price for a given level of benchmark performance has decreased remarkably fast, around to per year, for frontier models on knowledge, reasoning, math, and software engineering benchmarks. These reductions in the cost of AI inference are due to economic forces, hardware efficiency improvements, and algorithmic efficiency improvements. Isolating out open models to control for competition effects and dividing by hardware price declines, we estimate that algorithmic efficiency progress is around per year. Finally, we recommend that evaluators both publicize and take into account the price of benchmarking as an essential part of measuring the real-world impact of AI.

Paper Structure

This paper contains 14 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Annual factor change in price controlling for performance. We restrict our analysis to 2024-2025 models on the accuracy price pareto-frontier, and we present separate analyses for all models and only open weight models. Note, we do not have enough open-source data on SWE-V to report it \ref{['swe-v-caveat']}. See Table \ref{['tab:regression_results']} and Table \ref{['tab:regression_results_adjusted']} for more information.
  • Figure 2: Graph of benchmark price vs time for all models within a fixed GPQA-Diamond range. We don't have a good fit for models in the $40\%-60\%$ range, but include it here for consistency. We suspect that the large drop in overall price in this range is due to increased market competition.
  • Figure 3: Graph of benchmark price vs time for open weight models within a fixed GPQA-Diamond range. The price for higher quality models is decreasing faster than for lower quality models.
  • Figure 4: Price to run GPQA-Diamond benchmark. Prices based on Epoch-AI benchmark data and Artificial Analysis Prices. Overall, benchmark prices in our dataset have increased despite a dramatic fall in model price-performance.
  • Figure 5: Price to run SWE-bench Verified. Prices based on Epoch-AI benchmark data and Artificial Analysis Prices. Similar to Fig \ref{['fig:GPQA-Diamond-Price']}, benchmark prices have increased. In addition, the price to run SWE-bench Verified for some models is now in the thousands of dollars.