On the Limitations of Compute Thresholds as a Governance Strategy
Sara Hooker
TL;DR
This paper examines whether compute thresholds are a viable governance tool for AI risk, arguing that risk estimation and mitigation are uncertain and dynamic. It critiques FLOP-based thresholds, citing policy examples like $10^{26}$ FLOP (US EO) and $10^{25}$ FLOP (EU Act) as inadequately grounded and often misaligned with how risk actually emerges, since post-training inference, data quality, optimization, and architecture reshape risk. It reviews the limitations of scaling laws and the unpredictable mapping between compute and downstream capabilities, advocating dynamic, multi-metric risk indices rather than static FLOP targets. The significance lies in guiding policy toward horizon-aware, modality-specific, and transparently justified governance that can adapt as compute and architectural innovations continue to alter risk landscapes.
Abstract
At face value, this essay is about understanding a fairly esoteric governance tool called compute thresholds. However, in order to grapple with whether these thresholds will achieve anything, we must first understand how they came to be. To do so, we need to engage with a decades-old debate at the heart of computer science progress, namely, is bigger always better? Does a certain inflection point of compute result in changes to the risk profile of a model? Hence, this essay may be of interest not only to policymakers and the wider public but also to computer scientists interested in understanding the role of compute in unlocking breakthroughs. This discussion is timely given the wide adoption of compute thresholds in both the White House Executive Orders on AI Safety (EO) and the EU AI Act to identify more risky systems. A key conclusion of this essay is that compute thresholds, as currently implemented, are shortsighted and likely to fail to mitigate risk. The relationship between compute and risk is highly uncertain and rapidly changing. Relying upon compute thresholds overestimates our ability to predict what abilities emerge at different scales. This essay ends with recommendations for a better way forward.
