A Benchmark for Scalable Oversight Protocols

Abhimanyu Pallavi Sudhir; Jackson Kaunismaa; Arjun Panickssery

A Benchmark for Scalable Oversight Protocols

Abhimanyu Pallavi Sudhir, Jackson Kaunismaa, Arjun Panickssery

TL;DR

The paper addresses scalable oversight for superhuman AI by introducing a principled empirical framework based on the Agent Score Difference $ASD$, which quantifies how much a protocol incentivizes truth over deception. It presents SOlib, a Python library that generalizes experimentation across scalable-oversight protocols and defines the Expected Agent Score $EAS$ to project behavior under varying capabilities. Through a demonstrative benchmark on Debate, Consultancy, and Propaganda using tool-use on GSM8K, the work shows Debate generally yields stronger alignment incentives, while Consultancy is weak and more persuasive debaters improve truthfulness in Debate. The framework enables rapid prototyping and systematic cross-protocol evaluation, with caveats about extrapolating to superhuman regimes and the need for broader testing across models and tasks.

Abstract

As AI agents surpass human capabilities, scalable oversight -- the problem of effectively supplying human feedback to potentially superhuman AI models -- becomes increasingly critical to ensure alignment. While numerous scalable oversight protocols have been proposed, they lack a systematic empirical framework to evaluate and compare them. While recent works have tried to empirically study scalable oversight protocols -- particularly Debate -- we argue that the experiments they conduct are not generalizable to other protocols. We introduce the scalable oversight benchmark, a principled framework for evaluating human feedback mechanisms based on our agent score difference (ASD) metric, a measure of how effectively a mechanism advantages truth-telling over deception. We supply a Python package to facilitate rapid and competitive evaluation of scalable oversight protocols on our benchmark, and conduct a demonstrative experiment benchmarking Debate.

A Benchmark for Scalable Oversight Protocols

TL;DR

The paper addresses scalable oversight for superhuman AI by introducing a principled empirical framework based on the Agent Score Difference

, which quantifies how much a protocol incentivizes truth over deception. It presents SOlib, a Python library that generalizes experimentation across scalable-oversight protocols and defines the Expected Agent Score

to project behavior under varying capabilities. Through a demonstrative benchmark on Debate, Consultancy, and Propaganda using tool-use on GSM8K, the work shows Debate generally yields stronger alignment incentives, while Consultancy is weak and more persuasive debaters improve truthfulness in Debate. The framework enables rapid prototyping and systematic cross-protocol evaluation, with caveats about extrapolating to superhuman regimes and the need for broader testing across models and tasks.

A Benchmark for Scalable Oversight Protocols

TL;DR

Abstract

A Benchmark for Scalable Oversight Protocols

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)

Theorems & Definitions (3)