Table of Contents
Fetching ...

On the Origin of Algorithmic Progress in AI

Hans Gundlach, Alex Fogelson, Jayson Lynch, Ana Trisovic, Jonathan Rosenfeld, Anmol Sandhu, Neil Thompson

TL;DR

This paper interrogates the origins of algorithmic progress in AI by combining ablation experiments, scale-aware scaling analyses, and theoretical framing. It introduces the Compute Equivalent Gains (CEG) framework to distinguish scale-invariant from scale-dependent contributions and to emphasize the role of compute scale and reference points in measuring progress. The key finding is that most measured gains at small scales are scale-invariant and contribute modestly, while two major scale-dependent transitions—the LSTM-to-Transformer shift and the Kaplan-to-Chinchilla data/parameter rebalancing—account for the bulk of frontier improvements. The results imply that algorithmic progress is highly scale- and reference-dependent, predicting enormous gains only at very large compute and highlighting potential inequalities and challenges in forecasting AI development. Overall, the work reframes progress as a multi-dimensional, scale-sensitive phenomenon with significant implications for future research strategy and policy.

Abstract

Algorithms have been estimated to increase AI training FLOP efficiency by a factor of 22,000 between 2012 and 2023 [Ho et al., 2024]. Running small-scale ablation experiments on key innovations from this time period, we are able to account for less than 10x of these gains. Surveying the broader literature, we estimate that additional innovations not included in our ablations account for less than 10x, yielding a total under 100x. This leads us to conduct scaling experiments, which reveal that much of this efficiency gap can be explained by algorithms with scale-dependent efficiency improvements. In particular, we conduct scaling experiments between LSTMs and Transformers, finding exponent differences in their compute-optimal scaling law while finding little scaling difference for many other innovations. These experiments demonstrate that - contrary to standard assumptions - an algorithm's efficiency gains are tied to compute scale. Using experimental extrapolation and literature estimates, we account for 6,930x efficiency gains over the same time period, with the scale-dependent LSTM-to-Transformer transition accounting for the majority of gains. Our results indicate that algorithmic progress for small models has been far slower than previously assumed, and that measures of algorithmic efficiency are strongly reference-dependent.

On the Origin of Algorithmic Progress in AI

TL;DR

This paper interrogates the origins of algorithmic progress in AI by combining ablation experiments, scale-aware scaling analyses, and theoretical framing. It introduces the Compute Equivalent Gains (CEG) framework to distinguish scale-invariant from scale-dependent contributions and to emphasize the role of compute scale and reference points in measuring progress. The key finding is that most measured gains at small scales are scale-invariant and contribute modestly, while two major scale-dependent transitions—the LSTM-to-Transformer shift and the Kaplan-to-Chinchilla data/parameter rebalancing—account for the bulk of frontier improvements. The results imply that algorithmic progress is highly scale- and reference-dependent, predicting enormous gains only at very large compute and highlighting potential inequalities and challenges in forecasting AI development. Overall, the work reframes progress as a multi-dimensional, scale-sensitive phenomenon with significant implications for future research strategy and policy.

Abstract

Algorithms have been estimated to increase AI training FLOP efficiency by a factor of 22,000 between 2012 and 2023 [Ho et al., 2024]. Running small-scale ablation experiments on key innovations from this time period, we are able to account for less than 10x of these gains. Surveying the broader literature, we estimate that additional innovations not included in our ablations account for less than 10x, yielding a total under 100x. This leads us to conduct scaling experiments, which reveal that much of this efficiency gap can be explained by algorithms with scale-dependent efficiency improvements. In particular, we conduct scaling experiments between LSTMs and Transformers, finding exponent differences in their compute-optimal scaling law while finding little scaling difference for many other innovations. These experiments demonstrate that - contrary to standard assumptions - an algorithm's efficiency gains are tied to compute scale. Using experimental extrapolation and literature estimates, we account for 6,930x efficiency gains over the same time period, with the scale-dependent LSTM-to-Transformer transition accounting for the majority of gains. Our results indicate that algorithmic progress for small models has been far slower than previously assumed, and that measures of algorithmic efficiency are strongly reference-dependent.

Paper Structure

This paper contains 49 sections, 6 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Compute Equivalent Gain multiplier for algorithms measured on a 3.6M parameter transformer model using ablation experiments. Many recent training advancements have a small impact on training efficiency. Hatched bar represent improvements we believe are scale-dependent.
  • Figure 2: Comparison of the total effect of all measured post transformer algorithmic changes (left bar) vs multiplying ablation estimated effects together.
  • Figure 3: Comparison of the total effect of all measured changes, including transformer to (left bar) vs multiplying ablation estimated effects together.
  • Figure 4: Figure (a) depicts the scaling difference between a Modern Transformer in purple and a standard LSTM in green. Figure (b) depicts the scaling difference between a Modern Transformer in purple and a Retro Transformer in blue, where all post-2017 innovations are ablated. LSTM seem to have significantly different scaling exponents, while post-2017 transformers have minimal effect on scaling. All graphs depict the training curve for models with hidden dimensions between 32 and 256 with all other hyperparameter scaled proportionately.
  • Figure 5: Analytical difference in performance between transformer models scaled with Kaplan versus Chinchilla recommendations. Interestingly, the efficiency gap first converges, then diverges.
  • ...and 4 more figures