Table of Contents
Fetching ...

Off-Switching Not Guaranteed

Sven Neth

TL;DR

The paper scrutinizes the Off-Switch Game proposed for achieving human-AI deference by making AI uncertain about human preferences. It shows that the guarantee hinges on strong assumptions—$R$ maximizes expected utility, updates by conditionalization, and has perfect access to $U_a$—and that relaxing these conditions can destroy the guarantee, especially in the presence of noisy or misleading signals with misreporting probability $\epsilon$. The analysis demonstrates a separation between learning and deference: deferring does not universally equal acquiring information, and even small information-structure imperfections can make deferral suboptimal. Consequently, provably beneficial AI remains fragile under realistic assumptions, underscoring the need for more robust foundations or alternative alignment approaches beyond fixed reward maximization.

Abstract

Hadfield-Menell et al. (2017) propose the Off-Switch Game, a model of Human-AI cooperation in which AI agents always defer to humans because they are uncertain about our preferences. I explain two reasons why AI agents might not defer. First, AI agents might not value learning. Second, even if AI agents value learning, they might not be certain to learn our actual preferences.

Off-Switching Not Guaranteed

TL;DR

The paper scrutinizes the Off-Switch Game proposed for achieving human-AI deference by making AI uncertain about human preferences. It shows that the guarantee hinges on strong assumptions— maximizes expected utility, updates by conditionalization, and has perfect access to —and that relaxing these conditions can destroy the guarantee, especially in the presence of noisy or misleading signals with misreporting probability . The analysis demonstrates a separation between learning and deference: deferring does not universally equal acquiring information, and even small information-structure imperfections can make deferral suboptimal. Consequently, provably beneficial AI remains fragile under realistic assumptions, underscoring the need for more robust foundations or alternative alignment approaches beyond fixed reward maximization.

Abstract

Hadfield-Menell et al. (2017) propose the Off-Switch Game, a model of Human-AI cooperation in which AI agents always defer to humans because they are uncertain about our preferences. I explain two reasons why AI agents might not defer. First, AI agents might not value learning. Second, even if AI agents value learning, they might not be certain to learn our actual preferences.

Paper Structure

This paper contains 8 sections, 1 theorem, 1 equation, 3 figures.

Key Result

Theorem 1

Hadfield2017 If H follows a rational policy in the Off-Switch Game, the following hold:

Figures (3)

  • Figure 1: The Off-Switch Game Hadfield2017.
  • Figure 2: Rob's decision problem. The expected utility of deferring is $0.6 \times 30 + 0.4 \times 0 = 18$. For simplicity, I omit Rob's option to do nothing.
  • Figure 3: Rob's decision problem with uncertain access to preferences.

Theorems & Definitions (1)

  • Theorem 1