Off-Switching Not Guaranteed

Sven Neth

Off-Switching Not Guaranteed

Sven Neth

TL;DR

The paper scrutinizes the Off-Switch Game proposed for achieving human-AI deference by making AI uncertain about human preferences. It shows that the guarantee hinges on strong assumptions—$R$ maximizes expected utility, updates by conditionalization, and has perfect access to $U_a$—and that relaxing these conditions can destroy the guarantee, especially in the presence of noisy or misleading signals with misreporting probability $\epsilon$. The analysis demonstrates a separation between learning and deference: deferring does not universally equal acquiring information, and even small information-structure imperfections can make deferral suboptimal. Consequently, provably beneficial AI remains fragile under realistic assumptions, underscoring the need for more robust foundations or alternative alignment approaches beyond fixed reward maximization.

Abstract

Hadfield-Menell et al. (2017) propose the Off-Switch Game, a model of Human-AI cooperation in which AI agents always defer to humans because they are uncertain about our preferences. I explain two reasons why AI agents might not defer. First, AI agents might not value learning. Second, even if AI agents value learning, they might not be certain to learn our actual preferences.

Off-Switching Not Guaranteed

TL;DR

The paper scrutinizes the Off-Switch Game proposed for achieving human-AI deference by making AI uncertain about human preferences. It shows that the guarantee hinges on strong assumptions—

maximizes expected utility, updates by conditionalization, and has perfect access to

—and that relaxing these conditions can destroy the guarantee, especially in the presence of noisy or misleading signals with misreporting probability

. The analysis demonstrates a separation between learning and deference: deferring does not universally equal acquiring information, and even small information-structure imperfections can make deferral suboptimal. Consequently, provably beneficial AI remains fragile under realistic assumptions, underscoring the need for more robust foundations or alternative alignment approaches beyond fixed reward maximization.

Off-Switching Not Guaranteed

TL;DR

Abstract

Off-Switching Not Guaranteed

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (1)