Off-Switching Not Guaranteed
Sven Neth
TL;DR
The paper scrutinizes the Off-Switch Game proposed for achieving human-AI deference by making AI uncertain about human preferences. It shows that the guarantee hinges on strong assumptions—$R$ maximizes expected utility, updates by conditionalization, and has perfect access to $U_a$—and that relaxing these conditions can destroy the guarantee, especially in the presence of noisy or misleading signals with misreporting probability $\epsilon$. The analysis demonstrates a separation between learning and deference: deferring does not universally equal acquiring information, and even small information-structure imperfections can make deferral suboptimal. Consequently, provably beneficial AI remains fragile under realistic assumptions, underscoring the need for more robust foundations or alternative alignment approaches beyond fixed reward maximization.
Abstract
Hadfield-Menell et al. (2017) propose the Off-Switch Game, a model of Human-AI cooperation in which AI agents always defer to humans because they are uncertain about our preferences. I explain two reasons why AI agents might not defer. First, AI agents might not value learning. Second, even if AI agents value learning, they might not be certain to learn our actual preferences.
