Table of Contents
Fetching ...

Safety case template for frontier AI: A cyber inability argument

Arthur Goemans, Marie Davidsen Buhl, Jonas Schuett, Tomek Korbak, Jessica Wang, Benjamin Hilton, Geoffrey Irving

TL;DR

The paper proposes a CAE-based safety-case template tailored to frontier AI and offensive cyber capabilities, aimed at making safety arguments explicit from objective through risk models, proxy tasks, and evaluations. It discusses how to identify risk models, map them to proxy tasks, and evaluate performance to argue that the AI system cannot uplift cyber threats beyond a defined tier, using CAE as the organizing framework. The authors acknowledge key uncertainties, defeaters, and deployment considerations, and propose future work to enrich evaluations, address defeaters, and develop systematic safety-case methodologies. Overall, the work provides a structured blueprint to advance AI assurance in high-stakes cyber domains and to foster discussions among developers and regulators.

Abstract

Frontier artificial intelligence (AI) systems pose increasing risks to society, making it essential for developers to provide assurances about their safety. One approach to offering such assurances is through a safety case: a structured, evidence-based argument aimed at demonstrating why the risk associated with a safety-critical system is acceptable. In this article, we propose a safety case template for offensive cyber capabilities. We illustrate how developers could argue that a model does not have capabilities posing unacceptable cyber risks by breaking down the main claim into progressively specific sub-claims, each supported by evidence. In our template, we identify a number of risk models, derive proxy tasks from the risk models, define evaluation settings for the proxy tasks, and connect those with evaluation results. Elements of current frontier safety techniques - such as risk models, proxy tasks, and capability evaluations - use implicit arguments for overall system safety. This safety case template integrates these elements using the Claims Arguments Evidence (CAE) framework in order to make safety arguments coherent and explicit. While uncertainties around the specifics remain, this template serves as a proof of concept, aiming to foster discussion on AI safety cases and advance AI assurance.

Safety case template for frontier AI: A cyber inability argument

TL;DR

The paper proposes a CAE-based safety-case template tailored to frontier AI and offensive cyber capabilities, aimed at making safety arguments explicit from objective through risk models, proxy tasks, and evaluations. It discusses how to identify risk models, map them to proxy tasks, and evaluate performance to argue that the AI system cannot uplift cyber threats beyond a defined tier, using CAE as the organizing framework. The authors acknowledge key uncertainties, defeaters, and deployment considerations, and propose future work to enrich evaluations, address defeaters, and develop systematic safety-case methodologies. Overall, the work provides a structured blueprint to advance AI assurance in high-stakes cyber domains and to foster discussions among developers and regulators.

Abstract

Frontier artificial intelligence (AI) systems pose increasing risks to society, making it essential for developers to provide assurances about their safety. One approach to offering such assurances is through a safety case: a structured, evidence-based argument aimed at demonstrating why the risk associated with a safety-critical system is acceptable. In this article, we propose a safety case template for offensive cyber capabilities. We illustrate how developers could argue that a model does not have capabilities posing unacceptable cyber risks by breaking down the main claim into progressively specific sub-claims, each supported by evidence. In our template, we identify a number of risk models, derive proxy tasks from the risk models, define evaluation settings for the proxy tasks, and connect those with evaluation results. Elements of current frontier safety techniques - such as risk models, proxy tasks, and capability evaluations - use implicit arguments for overall system safety. This safety case template integrates these elements using the Claims Arguments Evidence (CAE) framework in order to make safety arguments coherent and explicit. While uncertainties around the specifics remain, this template serves as a proof of concept, aiming to foster discussion on AI safety cases and advance AI assurance.

Paper Structure

This paper contains 10 sections, 6 figures.

Figures (6)

  • Figure 1: Examples of different safety case components
  • Figure 2: Simplified safety case template for offensive cyber capabilities
  • Figure 3: Part 1 of the safety case template
  • Figure 4: Part 2 of the safety case template
  • Figure 5: Part 3 of the safety case template
  • ...and 1 more figures