A Benchmark for Scalable Oversight Protocols
Abhimanyu Pallavi Sudhir, Jackson Kaunismaa, Arjun Panickssery
TL;DR
The paper addresses scalable oversight for superhuman AI by introducing a principled empirical framework based on the Agent Score Difference $ASD$, which quantifies how much a protocol incentivizes truth over deception. It presents SOlib, a Python library that generalizes experimentation across scalable-oversight protocols and defines the Expected Agent Score $EAS$ to project behavior under varying capabilities. Through a demonstrative benchmark on Debate, Consultancy, and Propaganda using tool-use on GSM8K, the work shows Debate generally yields stronger alignment incentives, while Consultancy is weak and more persuasive debaters improve truthfulness in Debate. The framework enables rapid prototyping and systematic cross-protocol evaluation, with caveats about extrapolating to superhuman regimes and the need for broader testing across models and tasks.
Abstract
As AI agents surpass human capabilities, scalable oversight -- the problem of effectively supplying human feedback to potentially superhuman AI models -- becomes increasingly critical to ensure alignment. While numerous scalable oversight protocols have been proposed, they lack a systematic empirical framework to evaluate and compare them. While recent works have tried to empirically study scalable oversight protocols -- particularly Debate -- we argue that the experiments they conduct are not generalizable to other protocols. We introduce the scalable oversight benchmark, a principled framework for evaluating human feedback mechanisms based on our agent score difference (ASD) metric, a measure of how effectively a mechanism advantages truth-telling over deception. We supply a Python package to facilitate rapid and competitive evaluation of scalable oversight protocols on our benchmark, and conduct a demonstrative experiment benchmarking Debate.
