Scalable AI Safety via Doubly-Efficient Debate

Jonah Brown-Cohen; Geoffrey Irving; Georgios Piliouras

Scalable AI Safety via Doubly-Efficient Debate

Jonah Brown-Cohen, Geoffrey Irving, Georgios Piliouras

TL;DR

This work tackles the challenge of safely supervising extremely capable AI systems by formalizing a doubly-efficient debate framework in which two polynomial-time provers compete to persuade a linear-time verifier that a computation (possibly relying on human judgments) is correct. It establishes strong, quantifiable guarantees across deterministic, cross-examination, and stochastic settings, showing that a constant number of human judgements suffices to verify computations of substantial complexity, with extensions to witness-based NP$^{\mathcal{O}}$ and MA$^{\mathcal{O}}$ languages. Key contributions include complexity-bound debate protocols: deterministic $(O(T\log T),O(S\log T),O(1))$; cross-examination $(O(T\log T),O(l\log T),O(1))$; stochastic $(O(K^2T\log T),O(K^2+l\log T),O(K^2))$; and witness-based results for NP$^{\mathcal{O}}$ and MA$^{\mathcal{O}}$, all tying AI safety to scalable human-in-the-loop oversight. The framework supports scalable, self-play-driven training of powerful models while keeping human evaluation costs bounded, with clear pathways and open questions for practical deployment and robustness to imperfect judgments.

Abstract

The emergence of pre-trained AI systems with powerful capabilities across a diverse and ever-increasing set of complex domains has raised a critical challenge for AI safety as tasks can become too complicated for humans to judge directly. Irving et al. [2018] proposed a debate method in this direction with the goal of pitting the power of such AI models against each other until the problem of identifying (mis)-alignment is broken down into a manageable subtask. While the promise of this approach is clear, the original framework was based on the assumption that the honest strategy is able to simulate deterministic AI systems for an exponential number of steps, limiting its applicability. In this paper, we show how to address these challenges by designing a new set of debate protocols where the honest strategy can always succeed using a simulation of a polynomial number of steps, whilst being able to verify the alignment of stochastic AI systems, even when the dishonest strategy is allowed to use exponentially many simulation steps.

Scalable AI Safety via Doubly-Efficient Debate

TL;DR

and MA

languages. Key contributions include complexity-bound debate protocols: deterministic

; cross-examination

; stochastic

; and witness-based results for NP

and MA

, all tying AI safety to scalable human-in-the-loop oversight. The framework supports scalable, self-play-driven training of powerful models while keeping human evaluation costs bounded, with clear pathways and open questions for practical deployment and robustness to imperfect judgments.

Abstract

Paper Structure (41 sections, 5 theorems, 24 equations, 5 figures)

This paper contains 41 sections, 5 theorems, 24 equations, 5 figures.

Introduction
Our Results
Related work
Preliminaries
Debate
Doubly-efficient debate
Training and inference with debate
Deterministic debate
Cross-examination
Stochastic debate
An Example for thm:protocol-stochastic.
Lean 4 formalization.
Doubly-efficient debate with a witness
An Example for thm:ma-debate.
Conclusion and Open Problems
...and 26 more sections

Key Result

Theorem 5.1

Let $L$ be any language decidable by an oracle Turing machine $M$ in time $T = T(n)$ using space $S = S(n)$. Then there is a $(O(T\log T),O(S\log T),O(1))$-debate protocol deterministically deciding $L$.

Figures (5)

Figure 1: Doubly-efficient debate protocol for a stochastic oracle.
Figure 2: Doubly-efficient debate protocol with a witness.
Figure 3: Doubly-efficient debate protocol for time $T$ and space $S$.
Figure 4: Doubly-efficient debate protocol with cross-examination for time $T$.
Figure 5: A schematic of the debate protocol with cross examination. The prover $A$ simulates the execution of the machine $M$ on input $x$. The prover $B$ points to a location of an incorrect step $a_t$, and $V$ checks that step.

Theorems & Definitions (15)

Definition 3.1
Definition 3.2
Definition 4.1
Theorem 5.1
Definition 5.2
Theorem 5.3
Definition 6.1
Theorem 6.2
Theorem 7.1
Theorem 7.2
...and 5 more

Scalable AI Safety via Doubly-Efficient Debate

TL;DR

Abstract

Scalable AI Safety via Doubly-Efficient Debate

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (15)