Bare Minimum Mitigations for Autonomous AI Development
Joshua Clymer, Isabella Duan, Chris Cundy, Yawen Duan, Fynn Heide, Chaochao Lu, Sören Mindermann, Conor McGurk, Xudong Pan, Saad Siddiqui, Jingren Wang, Min Yang, Xianyuan Zhan
TL;DR
The paper addresses the risk that autonomous AI R&D could be rapidly automated and accelerated, outpacing governance and safety safeguards. It synthesizes insights from a two-day, multi-stakeholder workshop to define two thresholds—Threshold One (automation of most internal R&D) and Threshold Two (dramatic, rapid improvement to catastrophic capabilities)—and four concrete recommendations aimed at frontier AI developers and governments. It details threat models for each threshold (safety sabotage, unauthorized deployment, adaptation lag, and capability proliferation) and prescribes monitoring indicators and security measures to mitigate risks before they escalate. The work offers a non-binding, action-oriented framework intended to guide policy and industry practice toward safer, more resilient advancement of frontier AI technology.
Abstract
Artificial intelligence (AI) is advancing rapidly, with the potential for significantly automating AI research and development itself in the near future. In 2024, international scientists, including Turing Award recipients, warned of risks from autonomous AI research and development (R&D), suggesting a red line such that no AI system should be able to improve itself or other AI systems without explicit human approval and assistance. However, the criteria for meaningful human approval remain unclear, and there is limited analysis on the specific risks of autonomous AI R&D, how they arise, and how to mitigate them. In this brief paper, we outline how these risks may emerge and propose four minimum safeguard recommendations applicable when AI agents significantly automate or accelerate AI development.
