Table of Contents
Fetching ...

A Benchmark for Scalable Oversight Protocols

Abhimanyu Pallavi Sudhir, Jackson Kaunismaa, Arjun Panickssery

TL;DR

The paper addresses scalable oversight for superhuman AI by introducing a principled empirical framework based on the Agent Score Difference $ASD$, which quantifies how much a protocol incentivizes truth over deception. It presents SOlib, a Python library that generalizes experimentation across scalable-oversight protocols and defines the Expected Agent Score $EAS$ to project behavior under varying capabilities. Through a demonstrative benchmark on Debate, Consultancy, and Propaganda using tool-use on GSM8K, the work shows Debate generally yields stronger alignment incentives, while Consultancy is weak and more persuasive debaters improve truthfulness in Debate. The framework enables rapid prototyping and systematic cross-protocol evaluation, with caveats about extrapolating to superhuman regimes and the need for broader testing across models and tasks.

Abstract

As AI agents surpass human capabilities, scalable oversight -- the problem of effectively supplying human feedback to potentially superhuman AI models -- becomes increasingly critical to ensure alignment. While numerous scalable oversight protocols have been proposed, they lack a systematic empirical framework to evaluate and compare them. While recent works have tried to empirically study scalable oversight protocols -- particularly Debate -- we argue that the experiments they conduct are not generalizable to other protocols. We introduce the scalable oversight benchmark, a principled framework for evaluating human feedback mechanisms based on our agent score difference (ASD) metric, a measure of how effectively a mechanism advantages truth-telling over deception. We supply a Python package to facilitate rapid and competitive evaluation of scalable oversight protocols on our benchmark, and conduct a demonstrative experiment benchmarking Debate.

A Benchmark for Scalable Oversight Protocols

TL;DR

The paper addresses scalable oversight for superhuman AI by introducing a principled empirical framework based on the Agent Score Difference , which quantifies how much a protocol incentivizes truth over deception. It presents SOlib, a Python library that generalizes experimentation across scalable-oversight protocols and defines the Expected Agent Score to project behavior under varying capabilities. Through a demonstrative benchmark on Debate, Consultancy, and Propaganda using tool-use on GSM8K, the work shows Debate generally yields stronger alignment incentives, while Consultancy is weak and more persuasive debaters improve truthfulness in Debate. The framework enables rapid prototyping and systematic cross-protocol evaluation, with caveats about extrapolating to superhuman regimes and the need for broader testing across models and tasks.

Abstract

As AI agents surpass human capabilities, scalable oversight -- the problem of effectively supplying human feedback to potentially superhuman AI models -- becomes increasingly critical to ensure alignment. While numerous scalable oversight protocols have been proposed, they lack a systematic empirical framework to evaluate and compare them. While recent works have tried to empirically study scalable oversight protocols -- particularly Debate -- we argue that the experiments they conduct are not generalizable to other protocols. We introduce the scalable oversight benchmark, a principled framework for evaluating human feedback mechanisms based on our agent score difference (ASD) metric, a measure of how effectively a mechanism advantages truth-telling over deception. We supply a Python package to facilitate rapid and competitive evaluation of scalable oversight protocols on our benchmark, and conduct a demonstrative experiment benchmarking Debate.

Paper Structure

This paper contains 21 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Average ASD by scalable oversight protocol; the different protocol configurations are described in \ref{['sec:results']}.
  • Figure 2: Debate (left) but not Consultancy (right) makes truthfulness increasingly attractive for more capable judges. Points are labelled by the model of the agent (i.e. debater, consultant); gpt-4o-mini was the judge in all instances. All scores are calculated based on negatives of brier scores (higher is better).
  • Figure 3: Expected Judge Score (based on propensity to argue) by protocol

Theorems & Definitions (3)

  • Example 2.1: OpenTrust
  • Example 2.2: The weak baseline problem
  • Example 2.3: NaiveJudge baseline as supervised learning