Table of Contents
Fetching ...

The rising costs of training frontier AI models

Ben Cottier, Robi Rahman, Loredana Fattorini, Nestor Maslej, Tamay Besiroglu, David Owen

TL;DR

The study addresses the steep, under-publicized rise in frontier AI training costs and proposes three complementary estimation methods (amortized hardware CapEx + energy, cloud rental prices, and full model-development costs including R&D labor) to quantify trends. Using a large frontier-model dataset and hardware-price history, it finds a consistent $\approx$2.4× per-year growth since 2016, with accelerator chips and staff costs as dominant drivers. The analysis reveals that hardware acquisition costs greatly exceed amortized costs and that R&D labor can account for up to about half of total development costs, implying rising barriers to entry and concentration of frontier AI capability. If the trend continues, the most expensive public frontier models could approach $1B per training run by 2027, raising significant implications for governance, competition, and equitable access to AI advancement.

Abstract

The costs of training frontier AI models have grown dramatically in recent years, but there is limited public data on the magnitude and growth of these expenses. This paper develops a detailed cost model to address this gap, estimating training costs using three approaches that account for hardware, energy, cloud rental, and staff expenses. The analysis reveals that the amortized cost to train the most compute-intensive models has grown precipitously at a rate of 2.4x per year since 2016 (90% CI: 2.0x to 2.9x). For key frontier models, such as GPT-4 and Gemini, the most significant expenses are AI accelerator chips and staff costs, each costing tens of millions of dollars. Other notable costs include server components (15-22%), cluster-level interconnect (9-13%), and energy consumption (2-6%). If the trend of growing development costs continues, the largest training runs will cost more than a billion dollars by 2027, meaning that only the most well-funded organizations will be able to finance frontier AI models.

The rising costs of training frontier AI models

TL;DR

The study addresses the steep, under-publicized rise in frontier AI training costs and proposes three complementary estimation methods (amortized hardware CapEx + energy, cloud rental prices, and full model-development costs including R&D labor) to quantify trends. Using a large frontier-model dataset and hardware-price history, it finds a consistent 2.4× per-year growth since 2016, with accelerator chips and staff costs as dominant drivers. The analysis reveals that hardware acquisition costs greatly exceed amortized costs and that R&D labor can account for up to about half of total development costs, implying rising barriers to entry and concentration of frontier AI capability. If the trend continues, the most expensive public frontier models could approach $1B per training run by 2027, raising significant implications for governance, competition, and equitable access to AI advancement.

Abstract

The costs of training frontier AI models have grown dramatically in recent years, but there is limited public data on the magnitude and growth of these expenses. This paper develops a detailed cost model to address this gap, estimating training costs using three approaches that account for hardware, energy, cloud rental, and staff expenses. The analysis reveals that the amortized cost to train the most compute-intensive models has grown precipitously at a rate of 2.4x per year since 2016 (90% CI: 2.0x to 2.9x). For key frontier models, such as GPT-4 and Gemini, the most significant expenses are AI accelerator chips and staff costs, each costing tens of millions of dollars. Other notable costs include server components (15-22%), cluster-level interconnect (9-13%), and energy consumption (2-6%). If the trend of growing development costs continues, the largest training runs will cost more than a billion dollars by 2027, meaning that only the most well-funded organizations will be able to finance frontier AI models.
Paper Structure (35 sections, 10 equations, 10 figures, 5 tables)

This paper contains 35 sections, 10 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Amortized hardware cost plus energy cost for the final training run of frontier models. The selected models are among the top 10 most compute-intensive for their time. Amortized hardware costs are the product of training chip-hours and a depreciated hardware cost, with 23% overhead added for cluster-level networking. Open circles indicate costs which used an estimated production cost of Google TPU hardware. These costs are generally more uncertain than the others, which used actual price data rather than estimates.
  • Figure 2: (Reproduction of \ref{['fig:f1']} for convenience.) Amortized hardware cost plus energy cost for the final training run of frontier models. The selected models are among the top 10 most compute-intensive for their time. Amortized hardware costs are the product of training chip-hours and a depreciated hardware cost, with 23% overhead added for cluster-level networking. Open circles indicate costs which used an estimated production cost of Google TPU hardware. These costs are generally more uncertain than the others, which used actual price data rather than estimates.
  • Figure 3: Estimated cloud compute costs for the final training run of frontier models. The selected models are among the top 10 most compute-intensive for their time. The costs are the product of the number of training chip-hours and a historical cloud rental price.
  • Figure 4: Estimated hardware acquisition costs to train frontier models. The selected models are among the top 10 most compute-intensive for their time. The costs are the product of the number of servers and the earliest available server price, with about 23% overhead added for cluster-level networking hardware.
  • Figure 5: The percentage of the amortized hardware CapEx + energy estimates made up by different hardware and energy costs. Note that the breakdown across models is approximate. Cluster-level interconnect is assumed to be a constant 19% fraction of the cluster CapEx, and the proportion of server components is based on only three comparisons between NVIDIA DGX server prices and single GPU prices (see \ref{['sec:ap2']} for details). The energy costs are more specific, varying with the number of training chip-hours and the hardware (see \ref{['sec:ap5']}).
  • ...and 5 more figures