Table of Contents
Fetching ...

Clear, Compelling Arguments: Rethinking the Foundations of Frontier AI Safety Cases

Shaun Feakins, Ibrahim Habli, Phillip Morgan

TL;DR

Overall, this paper contributes holistic insights from the field of safety assurance via rigorous theory and methodologies that have been applied in safety-critical contexts via rigorous theory and methodologies that have been applied in safety-critical contexts.

Abstract

This paper contributes to the nascent debate around safety cases for frontier AI systems. Safety cases are structured, defensible arguments that a system is acceptably safe to deploy in a given context. Historically, they have been used in safety-critical industries, such as aerospace, nuclear or automotive. As a result, safety cases for frontier AI have risen in prominence, both in the safety policies of leading frontier developers and in international research agendas proposed by leaders in generative AI, such as the Singapore Consensus on Global AI Safety Research Priorities and the International AI Safety Report. This paper appraises this work. We note that research conducted within the alignment community which draws explicitly on lessons from the assurance community has significant limitations. We therefore aim to rethink existing approaches to alignment safety cases. We offer lessons from existing methodologies within safety assurance and outline the limitations involved in the alignment community's current approach. Building on this foundation, we present a case study for a safety case focused on Deceptive Alignment and CBRN capabilities, drawing on existing, theoretical safety case "sketches" created by the alignment safety case community. Overall, we contribute holistic insights from the field of safety assurance via rigorous theory and methodologies that have been applied in safety-critical contexts. We do so in order to create a better foundational framework for robust, defensible and useful safety case methodologies which can help to assure the safety of frontier AI systems.

Clear, Compelling Arguments: Rethinking the Foundations of Frontier AI Safety Cases

TL;DR

Overall, this paper contributes holistic insights from the field of safety assurance via rigorous theory and methodologies that have been applied in safety-critical contexts via rigorous theory and methodologies that have been applied in safety-critical contexts.

Abstract

This paper contributes to the nascent debate around safety cases for frontier AI systems. Safety cases are structured, defensible arguments that a system is acceptably safe to deploy in a given context. Historically, they have been used in safety-critical industries, such as aerospace, nuclear or automotive. As a result, safety cases for frontier AI have risen in prominence, both in the safety policies of leading frontier developers and in international research agendas proposed by leaders in generative AI, such as the Singapore Consensus on Global AI Safety Research Priorities and the International AI Safety Report. This paper appraises this work. We note that research conducted within the alignment community which draws explicitly on lessons from the assurance community has significant limitations. We therefore aim to rethink existing approaches to alignment safety cases. We offer lessons from existing methodologies within safety assurance and outline the limitations involved in the alignment community's current approach. Building on this foundation, we present a case study for a safety case focused on Deceptive Alignment and CBRN capabilities, drawing on existing, theoretical safety case "sketches" created by the alignment safety case community. Overall, we contribute holistic insights from the field of safety assurance via rigorous theory and methodologies that have been applied in safety-critical contexts. We do so in order to create a better foundational framework for robust, defensible and useful safety case methodologies which can help to assure the safety of frontier AI systems.
Paper Structure (26 sections, 7 figures)

This paper contains 26 sections, 7 figures.

Figures (7)

  • Figure 1: GSN Argument Blocks: alexander_2008_engineering
  • Figure 2: ACPs are used to indicate that a claim is accompanied by a confidence assertion. Uninstantiated and undeveloped elements indicate goals or evidence which need to be completed with a more concrete instance. goalstructuringnotationstandardworkinggroup_2011_gsn
  • Figure 3: The full GSN can be traced from a top-level goal down to supporting evidence for relevant subgoals and strategies
  • Figure 4: The top-level goal is supported by examination of identified hazardous events. Each hazardous event is accompanied by supporting goals and evidence, which is derived from an earlier risk assessment and risk reduction or evaluation methods. Evidence in GSN, i.e. solutions, are not claims but references to data or results. We focus on two of many potentially hazardous events. This case study focuses firstly on Deceptive Alignment, a hazard which is defined with less certainty:
  • Figure 5: Control of the hazardous event is supported by arguments over through-life controls and mitigations, split here into development, deployment and post-deployment. Each goal is then supported by evidence
  • ...and 2 more figures