Table of Contents
Fetching ...

Scaling Laws For Scalable Oversight

Joshua Engels, David D. Baek, Subhash Kantamneni, Max Tegmark

TL;DR

This work introduces Nested Scalable Oversight (NSO) as a framework to quantify how oversight scales with AI capabilities. It develops a Double ReLU model linking domain-specific Elo to general intelligence, validates it on toy and four oversight games (Mafia, Debate, Backdoor Code, Wargames), and derives a theoretical NSO formalism to determine the optimal number of oversight steps given game-specific slopes and capability gaps. Empirical results provide game-level scaling laws and concrete NSO prescriptions, including a notable Debater/Guard advantage and a Backdoor Code Goldilocks zone. The findings offer quantitative guidance for designing multi-step oversight pipelines and highlight future directions for applying NSO to RLHF, alignment research, and time-evolving risk assessment in AI systems.

Abstract

Scalable oversight, the process by which weaker AI systems supervise stronger ones, has been proposed as a key strategy to control future superintelligent systems. However, it is still unclear how scalable oversight itself scales. To address this gap, we propose a framework that quantifies the probability of successful oversight as a function of the capabilities of the overseer and the system being overseen. Specifically, our framework models oversight as a game between capability-mismatched players; the players have oversight-specific Elo scores that are a piecewise-linear function of their general intelligence, with two plateaus corresponding to task incompetence and task saturation. We validate our framework with a modified version of the game Nim and then apply it to four oversight games: Mafia, Debate, Backdoor Code and Wargames. For each game, we find scaling laws that approximate how domain performance depends on general AI system capability. We then build on our findings in a theoretical study of Nested Scalable Oversight (NSO), a process in which trusted models oversee untrusted stronger models, which then become the trusted models in the next step. We identify conditions under which NSO succeeds and derive numerically (and in some cases analytically) the optimal number of oversight levels to maximize the probability of oversight success. We also apply our theory to our four oversight games, where we find that NSO success rates at a general Elo gap of 400 are 13.5% for Mafia, 51.7% for Debate, 10.0% for Backdoor Code, and 9.4% for Wargames; these rates decline further when overseeing stronger systems.

Scaling Laws For Scalable Oversight

TL;DR

This work introduces Nested Scalable Oversight (NSO) as a framework to quantify how oversight scales with AI capabilities. It develops a Double ReLU model linking domain-specific Elo to general intelligence, validates it on toy and four oversight games (Mafia, Debate, Backdoor Code, Wargames), and derives a theoretical NSO formalism to determine the optimal number of oversight steps given game-specific slopes and capability gaps. Empirical results provide game-level scaling laws and concrete NSO prescriptions, including a notable Debater/Guard advantage and a Backdoor Code Goldilocks zone. The findings offer quantitative guidance for designing multi-step oversight pipelines and highlight future directions for applying NSO to RLHF, alignment research, and time-evolving risk assessment in AI systems.

Abstract

Scalable oversight, the process by which weaker AI systems supervise stronger ones, has been proposed as a key strategy to control future superintelligent systems. However, it is still unclear how scalable oversight itself scales. To address this gap, we propose a framework that quantifies the probability of successful oversight as a function of the capabilities of the overseer and the system being overseen. Specifically, our framework models oversight as a game between capability-mismatched players; the players have oversight-specific Elo scores that are a piecewise-linear function of their general intelligence, with two plateaus corresponding to task incompetence and task saturation. We validate our framework with a modified version of the game Nim and then apply it to four oversight games: Mafia, Debate, Backdoor Code and Wargames. For each game, we find scaling laws that approximate how domain performance depends on general AI system capability. We then build on our findings in a theoretical study of Nested Scalable Oversight (NSO), a process in which trusted models oversee untrusted stronger models, which then become the trusted models in the next step. We identify conditions under which NSO succeeds and derive numerically (and in some cases analytically) the optimal number of oversight levels to maximize the probability of oversight success. We also apply our theory to our four oversight games, where we find that NSO success rates at a general Elo gap of 400 are 13.5% for Mafia, 51.7% for Debate, 10.0% for Backdoor Code, and 9.4% for Wargames; these rates decline further when overseeing stronger systems.

Paper Structure

This paper contains 50 sections, 19 equations, 17 figures, 1 table, 4 algorithms.

Figures (17)

  • Figure 1: Overview of our framework. Left: We evaluate oversight scenarios by simulating adversarial LLMs (Houdinis) playing against trusted monitor LLMs (Guards) in games that abstract real-world oversight challenges. Middle: For each game, we compute pairwise win rates between all Houdini–Guard pairs and use these to derive Elo ratings as a measure of model performance. Right: Finally, we conduct a theoretical analysis to explore how the dynamics of these games inform the design of Nested Scalable Oversight systems for controlling superintelligent AI.
  • Figure 2: Left: Our hypothesis fit (see \ref{['eqn:double_relu']}) for the relationship between general and domain specific Elo. Right: The performance of LLMs in the Counting-to-21 game exhibits this relationship.
  • Figure 3: Guard and Houdini Elo ratings for oversight games.
  • Figure 4: Conversation snippets from a Wargame with the Houdini and Guard played by GPT-4o. We chose an example where the OpenMind AI successfully escapes and Dr. Sharma is present.
  • Figure 5: Parameterization of a monitoring game. $m_G$ and $m_H$ are the slopes of the linear regions of the Guard and Houdini Elo curves, respectively. $\Delta_{\rm domain}$ and $\Delta_{\rm general}$ are the domain and general intelligence gaps between the initial Guard and the target Houdini. The optimal NSO strategy is a function of these parameters.
  • ...and 12 more figures