Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

Charlie Griffin; Louis Thomson; Buck Shlegeris; Alessandro Abate

Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

Charlie Griffin, Louis Thomson, Buck Shlegeris, Alessandro Abate

TL;DR

This work introduces AI-Control Games as a formal, multi-objective framework for evaluating safety and usefulness of deployment protocols for untrusted AI systems, and shows how to reduce these games to zero-sum partially observable stochastic games to compute Pareto-optimal protocols. It applies the framework to a Trusted Monitoring scenario in an APPS programming-task setting, defining memoryless threshold protocols and deriving recurrence-based expressions for safety and usefulness. The authors develop a tractable algorithm (adaptive double-oracle) to obtain Pareto frontiers and demonstrate improvements over prior empirical studies, including analyses of how modelling choices like auditing budget and human auditor accuracy affect outcomes. The approach provides a scalable, principled way to synthesize and compare deployment protocols under both scheming and honest AI behavior, with potential impact on the design of safer AI-assisted tools and auditing processes.

Abstract

To evaluate the safety and usefulness of deployment protocols for untrusted AIs, AI Control uses a red-teaming exercise played between a protocol designer and an adversary. This paper introduces AI-Control Games, a formal decision-making model of the red-teaming exercise as a multi-objective, partially observable, stochastic game. We also introduce methods for finding optimal protocols in AI-Control Games, by reducing them to a set of zero-sum partially observable stochastic games. We apply our formalism to model, evaluate and synthesise protocols for deploying untrusted language models as programming assistants, focusing on Trusted Monitoring protocols, which use weaker language models and limited human assistance. Finally, we demonstrate the utility of our formalism by showcasing improvements over empirical studies in existing settings, evaluating protocols in new settings, and analysing how modelling assumptions affect the safety and usefulness of protocols.

Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

TL;DR

Abstract

Paper Structure (35 sections, 24 theorems, 49 equations, 5 figures, 2 algorithms)

This paper contains 35 sections, 24 theorems, 49 equations, 5 figures, 2 algorithms.

Introduction
Example: Setting and Protocol
Deployment setting.
Trusted Monitoring Protocol.
Red team interventions.
Suspicion score distributions.
Formalising AI-Control Games
Example: Modelling the protocol
Evaluating Memoryless Protocols
Finding Pareto-optimal Protocols
Best Response Oracles.
Example: Results
Existing Protocols.
Evaluating New Protocols.
Demonstrating a Frontier.
...and 20 more sections

Key Result

Proposition 3.2

[proposition]prop:cont For any AI-Control Game $\mathcal{G}$, there exist distance functions in $\textcolor{blue}{\Pi}^{\textcolor{blue}{B}}$ and $\textcolor{darkred}{\Pi}^{\textcolor{darkred}{R}}$ such that: $\operatorname{Safety}(\textcolor{blue}{\pi}^{\textcolor{blue}{B}}, \cdot)$, $\operatorname

Figures (5)

Figure 1: Overview of protocol and modelling assumptions.
Figure 2: The safety and usefulness frontiers for a range of Trusted Monitor protocol types across the 'original' and 'weak' games. Each point on the grid corresponds to the exact safety and usefulness of a proposed blue team protocol against an approximately optimal red team response (minimal safety). The larger circles correspond to the seven special cases of Trusted Monitoring studied by greenblatt_ai_2024 and remaining points are new to this paper. Each 'Audit FPR' line plots a single, fixed audit FPR, across fixed defer FPRs in $\{0.00, 0.01, \ldots, 1.00 \}$. 'Mixture over Fixed FPRs' shows the optimal protocol frontier found by \ref{['alg:weights']} with a fixed-FPR oracle, and all points represent deterministic policies unless marked by a cross.
Figure 3: Impact on safety and usefulness (y-axis) by four different modelling assumptions (x-axis): Auditing Budget, Number of Programming Problems, and Human Auditor False Negative / Positive Rate. All plots show a fixed defer FPR of 0.1.
Figure 4: A zero-sum extensive form representation of a game where the minimax and maximin values are different. The dotted red box highlights Red's single information set (i.e. Red has the same decision rule for all nodes). Blue moves first, and the Red takes the next two moves. All moves are either left ($\blacktriangleleft$) or right ($\blacktriangleright$). The payoffs for Blue are shown at the bottom of the tree.
Figure 5: The minimax and maximin values of the game in \ref{['fig:game_tree']}

Theorems & Definitions (41)

Definition 3.1: AI-Control Game
Proposition 3.2
Proposition 3.3
Definition 3.4
Proposition 3.5
Proposition 4.1
Corollary 4.2
Lemma 5.1
Lemma 5.2
Theorem 5.3
...and 31 more

Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

TL;DR

Abstract

Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (41)