Modeling Human Beliefs about AI Behavior for Scalable Oversight

Leon Lang; Patrick Forré

Modeling Human Beliefs about AI Behavior for Scalable Oversight

Leon Lang, Patrick Forré

TL;DR

The paper tackles scalable oversight by formalizing human belief models that explain how evaluators' imperfect beliefs shape feedback about AI behavior. It introduces a mathematical framework where a human belief model $oldsymbol{ m iny oldsymbol{ oast}}=(oldsymbol{oldsymbol{ m iny m oldsymbol{ m iny oast}}},oldsymbol{ m iny oldsymbol{ m iny oast}},oldsymbol{ m iny oldsymbol{ m iny oast}},oldsymbol{ m iny oldsymbol{ m iny oast}})$ represents the human's feature ontology and observation-belief, linking to a return function via $oldsymbol{ m iny oldsymbol{ m iny oast}}$ and $G_{ m iny oast}$; the key novelty is the analysis of ambiguity in recovering $G$ and the conditions for completeness. The authors then introduce belief-model covering and morphisms as a way to relax precise modeling while preserving identifiability, showing that a complete covering model yields $G$ uniquely from $G_{ m iny oast}$. They propose a practical path using adapted foundation models to construct covering belief models, with linear ontology translations and reward probes to learn $G$ for policy optimization. The work maps out theoretical guarantees, conceptual examples (including symmetry-invariant rewards), and a roadmap for empirical evaluation, highlighting how complete covering models could enable scalable oversight even when human evaluators misunderstand AI behavior. Overall, the theory provides a principled route to extract correct human-aligned objectives from possibly faulty feedback, guiding future theory, empirical validation, and practical implementations in scalable AI safety.

Abstract

As AI systems advance beyond human capabilities, scalable oversight becomes critical: how can we supervise AI that exceeds our abilities? A key challenge is that human evaluators may form incorrect beliefs about AI behavior in complex tasks, leading to unreliable feedback and poor value inference. To address this, we propose modeling evaluators' beliefs to interpret their feedback more reliably. We formalize human belief models, analyze their theoretical role in value learning, and characterize when ambiguity remains. To reduce reliance on precise belief models, we introduce "belief model covering" as a relaxation. This motivates our preliminary proposal to use the internal representations of adapted foundation models to mimic human evaluators' beliefs. These representations could be used to learn correct values from human feedback even when evaluators misunderstand the AI's behavior. Our work suggests that modeling human beliefs can improve value learning and outlines practical research directions for implementing this approach to scalable oversight.

Modeling Human Beliefs about AI Behavior for Scalable Oversight

TL;DR

Abstract

Modeling Human Beliefs about AI Behavior for Scalable Oversight

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (49)