On the Limitations of Compute Thresholds as a Governance Strategy

Sara Hooker

On the Limitations of Compute Thresholds as a Governance Strategy

Sara Hooker

TL;DR

This paper examines whether compute thresholds are a viable governance tool for AI risk, arguing that risk estimation and mitigation are uncertain and dynamic. It critiques FLOP-based thresholds, citing policy examples like $10^{26}$ FLOP (US EO) and $10^{25}$ FLOP (EU Act) as inadequately grounded and often misaligned with how risk actually emerges, since post-training inference, data quality, optimization, and architecture reshape risk. It reviews the limitations of scaling laws and the unpredictable mapping between compute and downstream capabilities, advocating dynamic, multi-metric risk indices rather than static FLOP targets. The significance lies in guiding policy toward horizon-aware, modality-specific, and transparently justified governance that can adapt as compute and architectural innovations continue to alter risk landscapes.

Abstract

At face value, this essay is about understanding a fairly esoteric governance tool called compute thresholds. However, in order to grapple with whether these thresholds will achieve anything, we must first understand how they came to be. To do so, we need to engage with a decades-old debate at the heart of computer science progress, namely, is bigger always better? Does a certain inflection point of compute result in changes to the risk profile of a model? Hence, this essay may be of interest not only to policymakers and the wider public but also to computer scientists interested in understanding the role of compute in unlocking breakthroughs. This discussion is timely given the wide adoption of compute thresholds in both the White House Executive Orders on AI Safety (EO) and the EU AI Act to identify more risky systems. A key conclusion of this essay is that compute thresholds, as currently implemented, are shortsighted and likely to fail to mitigate risk. The relationship between compute and risk is highly uncertain and rapidly changing. Relying upon compute thresholds overestimates our ability to predict what abilities emerge at different scales. This essay ends with recommendations for a better way forward.

On the Limitations of Compute Thresholds as a Governance Strategy

TL;DR

FLOP (US EO) and

FLOP (EU Act) as inadequately grounded and often misaligned with how risk actually emerges, since post-training inference, data quality, optimization, and architecture reshape risk. It reviews the limitations of scaling laws and the unpredictable mapping between compute and downstream capabilities, advocating dynamic, multi-metric risk indices rather than static FLOP targets. The significance lies in guiding policy toward horizon-aware, modality-specific, and transparently justified governance that can adapt as compute and architectural innovations continue to alter risk landscapes.

Abstract

Paper Structure (25 sections, 9 figures)

This paper contains 25 sections, 9 figures.

Understanding Risk
The Uncertain Relationship Between Compute and Risk.
A shift in the relationship between compute and performance
Data quality reduces reliance on compute.
Optimization breakthroughs compensate for compute.
Architecture plays a significant role in determining scalability
Avoiding a FLOP FLOP
Challenges of using FLOP as a metric
Training FLOP doesn't account for post-training leaps in performance
Difficulty Tracking FLOP across model lifecycle.
How to handle Mixture of Experts (MoEs) and classic ensembling?
FLOP only accounts for a single model, but does not capture risk of the overall system.
FLOP varies dramatically across different modalities.
We are not very good at predicting the relationship between compute and risk
Limitations of scaling laws.
...and 10 more sections

Figures (9)

Figure 1: Effective governance requires both 1) estimating the level and origins of risk to society (see Right) and 2) aligning on a proportionate response (see Left). History is replete with examples where one or both of these stages fail. This note applies this lens to understand the viability of policies aimed at mitigating the risks introduced by a new era of Generative AI models. We ask whether 1) we have correctly estimated the role of compute in amplifying generative AI model risk, and 2) are hard-coded compute thresholds a meaningful tool for mitigating risk?
Figure 2: Bytes Magazine Cover, Volume 2, 1977. A key characteristic of modern societies is our ability to choose amongst future alternatives by controlling for risk. One of the challenges is how to balance future unknown risks and risks of harm presented today. Compute thresholds as currently implemented are an example of precautionary policy -- few models currently deployed in the wild fulfill the current criteria. This implies that the emphasis is not on auditing risks incurred by current models -- but rather based upon the belief that future levels of compute will introduce new unforeseen risks.
Figure 3: The changing relationship between compute and performance. Smaller models are becoming increasingly performant and routinely now outperform much larger models. Right: Plot of the best daily 13B or smaller model submitted to the Open LLM leaderboard over time. Even amongst comparable small sized models, performance has been growing rapidly. Left: The best small models submitted to the Open LLM leaderboard easily outperform far larger models. We observe that over time there have been more and more large models which are easily out-competed by small <13B models. In the left plot, scatter plot is sized by number of parameters to give a sense of proportion of each model submitted.
Figure 4: Bytes Magazine Cover, Volume 5, 1980. Compute is rarely the only determinant of progress. Data quality, instruction-finetuning, preference training, retrieval augmented networks, enabled tool use, chain-of-thought reasoning, increased context-length are all examples of optimization techniques which add little or no training FLOP but result in significant gains in performance.
Figure 5: Different modalities have very different compute requirements Right: A plot of all models tracked in the Epoch AI database. While model size has grown overall, some domains are far more prone to scaling such as language. Left: We also plot the boxplot distribution for systems that Epoch AI classifies as notable for the same period of time (2010-24) and see pronounced differences in the distributions between modalities. Language models have many training compute outliers, whereas notable systems from vision, biology, and image generation models tend to be characterized by models that require far fewer training FLOP epoch2023pcdtrends
...and 4 more figures

On the Limitations of Compute Thresholds as a Governance Strategy

TL;DR

Abstract

On the Limitations of Compute Thresholds as a Governance Strategy

Authors

TL;DR

Abstract

Table of Contents

Figures (9)