Table of Contents
Fetching ...

Training Compute Thresholds: Features and Functions in AI Regulation

Lennart Heim, Leonie Koessler

TL;DR

The paper argues that training compute thresholds are the most suitable initial filter for GPAI regulation because they correlate with risk, are quantifiable, and can be assessed early and externally verified. It details the features that make compute a robust regulatory signal, acknowledges significant limitations, and explains how thresholds should function as first-step gates complemented by capability evaluations and risk assessments. The authors map these ideas to existing policies like the US AI EO and EU AI Act, discuss strategic challenges in threshold setting and updating, and explore alternative metrics such as risk estimates and effective compute for a more nuanced governance framework. The practical impact is a cautious, evolution-aware regulatory approach that uses compute thresholds to raise visibility and trigger deeper scrutiny without over-relying on a single proxy for risk.

Abstract

Regulators in the US and EU are using thresholds based on training compute--the number of computational operations used in training--to identify general-purpose artificial intelligence (GPAI) models that may pose risks of large-scale societal harm. We argue that training compute currently is the most suitable metric to identify GPAI models that deserve regulatory oversight and further scrutiny. Training compute correlates with model capabilities and risks, is quantifiable, can be measured early in the AI lifecycle, and can be verified by external actors, among other advantageous features. These features make compute thresholds considerably more suitable than other proposed metrics to serve as an initial filter to trigger additional regulatory requirements and scrutiny. However, training compute is an imperfect proxy for risk. As such, compute thresholds should not be used in isolation to determine appropriate mitigation measures. Instead, they should be used to detect potentially risky GPAI models that warrant regulatory oversight, such as through notification requirements, and further scrutiny, such as via model evaluations and risk assessments, the results of which may inform which mitigation measures are appropriate. In fact, this appears largely consistent with how compute thresholds are used today. As GPAI technology and market structures evolve, regulators should update compute thresholds and complement them with other metrics into regulatory review processes.

Training Compute Thresholds: Features and Functions in AI Regulation

TL;DR

The paper argues that training compute thresholds are the most suitable initial filter for GPAI regulation because they correlate with risk, are quantifiable, and can be assessed early and externally verified. It details the features that make compute a robust regulatory signal, acknowledges significant limitations, and explains how thresholds should function as first-step gates complemented by capability evaluations and risk assessments. The authors map these ideas to existing policies like the US AI EO and EU AI Act, discuss strategic challenges in threshold setting and updating, and explore alternative metrics such as risk estimates and effective compute for a more nuanced governance framework. The practical impact is a cautious, evolution-aware regulatory approach that uses compute thresholds to raise visibility and trigger deeper scrutiny without over-relying on a single proxy for risk.

Abstract

Regulators in the US and EU are using thresholds based on training compute--the number of computational operations used in training--to identify general-purpose artificial intelligence (GPAI) models that may pose risks of large-scale societal harm. We argue that training compute currently is the most suitable metric to identify GPAI models that deserve regulatory oversight and further scrutiny. Training compute correlates with model capabilities and risks, is quantifiable, can be measured early in the AI lifecycle, and can be verified by external actors, among other advantageous features. These features make compute thresholds considerably more suitable than other proposed metrics to serve as an initial filter to trigger additional regulatory requirements and scrutiny. However, training compute is an imperfect proxy for risk. As such, compute thresholds should not be used in isolation to determine appropriate mitigation measures. Instead, they should be used to detect potentially risky GPAI models that warrant regulatory oversight, such as through notification requirements, and further scrutiny, such as via model evaluations and risk assessments, the results of which may inform which mitigation measures are appropriate. In fact, this appears largely consistent with how compute thresholds are used today. As GPAI technology and market structures evolve, regulators should update compute thresholds and complement them with other metrics into regulatory review processes.
Paper Structure (18 sections, 11 figures, 1 table)

This paper contains 18 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: Training compute has been increasing at a fast rate, doubling roughly every 6 months ($4\times$ per year). The US AI EO introduces reporting requirements for models trained with more than $10^{26}$ operations. The EU AI Act presumes a GPAI model poses systemic risk and imposes a variety of requirements for models trained with more than $10^{25}$ operations.
  • Figure 2: Compute thresholds serve as an initial filter to identify GPAI models that warrant regulatory oversight and further scrutiny, and, for example, evaluation against capability thresholds to determine appropriate mitigation measures, complemented by other AI requirements.
  • Figure 3: Amount of compute used to train AI models over time. In the pre-deep learning era, training compute followed Moore's Law, doubling approximately every two years. Since the emergence of the Deep Learning Era around 2010, training compute has been increasing at a much faster rate, doubling roughly every 6 months (increasing by about $4\times$ per year). This rapid growth is largely driven by increased investments in computational resources for training larger models, which have demonstrated improved capabilities (figure from sastry2024; up-to-date as of end of 2023; underlying data and updates can be found at 2024).
  • Figure 4: We recommend only measuring pre-training compute and not including compute used in further enhancement processes (figure adapted from pistilloforthcoming).
  • Figure 5: Cost and compute used for training AI models. The amount of compute used to train a model directly corresponds to the amount of financial resources required to do so.This rapid growth is largely driven by increased investments in computational resources for training larger models, which have demonstrated improved capabilities (figure adapted from sastry2024; underlying data and updates can be found at 2024).
  • ...and 6 more figures