Table of Contents
Fetching ...

Algorithmic progress in language models

Anson Ho, Tamay Besiroglu, Ege Erdil, David Owen, Robi Rahman, Zifan Carl Guo, David Atkinson, Neil Thompson, Jaime Sevilla

TL;DR

Addressing how much language-model pre-training progress comes from algorithmic innovations versus scaling up compute and data, the paper constructs a dataset of over 200 evaluations on WikiText benchmarks and fits an augmented scaling-law model with effective compute. It demonstrates that the compute needed to reach a fixed performance halves about every 8–9 months, driven primarily by compute scaling, with algorithmic progress contributing a smaller share. The transformer architecture yields substantial compute-equivalent gains, estimated around 7.2×, but the overall gains remain dominated by scaling compute budgets. The work highlights the value and limits of current scaling laws for forecasting future progress and informs how researchers allocate compute and algorithmic research.

Abstract

We investigate the rate at which algorithms for pre-training language models have improved since the advent of deep learning. Using a dataset of over 200 language model evaluations on Wikitext and Penn Treebank spanning 2012-2023, we find that the compute required to reach a set performance threshold has halved approximately every 8 months, with a 95% confidence interval of around 5 to 14 months, substantially faster than hardware gains per Moore's Law. We estimate augmented scaling laws, which enable us to quantify algorithmic progress and determine the relative contributions of scaling models versus innovations in training algorithms. Despite the rapid pace of algorithmic progress and the development of new architectures such as the transformer, our analysis reveals that the increase in compute made an even larger contribution to overall performance improvements over this time period. Though limited by noisy benchmark data, our analysis quantifies the rapid progress in language modeling, shedding light on the relative contributions from compute and algorithms.

Algorithmic progress in language models

TL;DR

Addressing how much language-model pre-training progress comes from algorithmic innovations versus scaling up compute and data, the paper constructs a dataset of over 200 evaluations on WikiText benchmarks and fits an augmented scaling-law model with effective compute. It demonstrates that the compute needed to reach a fixed performance halves about every 8–9 months, driven primarily by compute scaling, with algorithmic progress contributing a smaller share. The transformer architecture yields substantial compute-equivalent gains, estimated around 7.2×, but the overall gains remain dominated by scaling compute budgets. The work highlights the value and limits of current scaling laws for forecasting future progress and informs how researchers allocate compute and algorithmic research.

Abstract

We investigate the rate at which algorithms for pre-training language models have improved since the advent of deep learning. Using a dataset of over 200 language model evaluations on Wikitext and Penn Treebank spanning 2012-2023, we find that the compute required to reach a set performance threshold has halved approximately every 8 months, with a 95% confidence interval of around 5 to 14 months, substantially faster than hardware gains per Moore's Law. We estimate augmented scaling laws, which enable us to quantify algorithmic progress and determine the relative contributions of scaling models versus innovations in training algorithms. Despite the rapid pace of algorithmic progress and the development of new architectures such as the transformer, our analysis reveals that the increase in compute made an even larger contribution to overall performance improvements over this time period. Though limited by noisy benchmark data, our analysis quantifies the rapid progress in language modeling, shedding light on the relative contributions from compute and algorithms.
Paper Structure (42 sections, 45 equations, 15 figures, 11 tables)

This paper contains 42 sections, 45 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: Estimates of effective compute doubling from algorithmic improvements across different domains. Blue dots represent central estimates or ranges; blue triangles correspond to doubling times for problems at different sizes (ranging from 1K to 1B); purple dashed line corresponds to the 2-year doubling time associated with Moore's law. koch2022progress estimate range spans estimates for integer and mixed-integer linear programming.
  • Figure 2: Log of perplexity of models used in our work, of over 231 language models analyzed in our work spanning over 8 orders of magnitude of compute, with each shape representing a model. The size of the shape is proportional to the compute used during training. Comparable perplexity evaluations are curated from the existing literature and from our own evaluations.
  • Figure 4: Comparison of estimated doubling times for effective compute from algorithmic progress, before and after set cutoff years from 2016-2020. Shorter doubling times in the "post" period relative to "pre" indicate an acceleration in the rate of algorithmic progress after that cutoff year. Longer doubling times indicate a deceleration.
  • Figure 5: A stylized illustration of the relative contribution of compute scaling and algorithmic progress to effective compute. The physical compute contribution is estimated from the doubling times in sevillacompute, and the algorithmic progress contribution is based on the aggregated doubling time estimate from the top 10 models in cross validation (see section \ref{['sec:doubling-times']}). We further plot the physical training compute values for several notable models (e.g. GPT-2) in their publication years.
  • Figure 7: Relative compute (relative to baseline model) used to train models that achieve the same evaluated perplexity as Megatron-LM, GPT-2, and Gopher respectively. Doubling times of effective compute are 14.9, 5.9, and 6.9 months using least squares regression for Megatron-LM (cross-entropy range 2.87-3.06), GPT-2 (cross-entropy range 2.79-2.93), and Gopher (cross-entropy range 1.87-2.32), respectively. Circles are proportional to the compute used during training.
  • ...and 10 more figures