Core Safety Values for Provably Corrigible Agents
Aran Nayebi
TL;DR
This work tackles the corrigibility problem by introducing a complete framework with five structurally separate utility heads—$U_1$ through $U_5$—that are combined lexicographically to ensure obedient shutdown, limited impact, truthful interaction, and bounded task reward. It proves exact single-step corrigibility in a partially observable off-switch game and extends these guarantees to multi-step, self-spawning agents, even under learning and planning error bounds. The authors also show that safety verification under hacking is undecidable in general, but identify a finite-horizon, privacy-preserving decidable island where safety can be audited efficiently using zero-knowledge proofs. The framework thus converts corrigibility from a philosophical ideal into an auditable design with provable dominance guarantees, along with practical verification strategies for restricted horizons. This has significant implications for deploying safe, corrigible AI systems in open, multi-step, partially observed environments.
Abstract
We introduce the first complete formal solution to corrigibility in the off-switch game, with provable guarantees in multi-step, partially observed environments. Our framework consists of five *structurally separate* utility heads -- deference, switch-access preservation, truthfulness, low-impact behavior via a belief-based extension of Attainable Utility Preservation, and bounded task reward -- combined lexicographically by strict weight gaps. Theorem 1 proves exact single-round corrigibility in the partially observable off-switch game; Theorem 3 extends the guarantee to multi-step, self-spawning agents, showing that even if each head is *learned* to mean-squared error $\varepsilon$ and the planner is $\varepsilon$-sub-optimal, the probability of violating *any* safety property is bounded while still ensuring net human benefit. In contrast to Constitutional AI or RLHF/RLAIF, which merge all norms into one learned scalar, our separation makes obedience and impact-limits provably dominate even when incentives conflict. For settings where adversaries can modify the agent, we prove that deciding whether an arbitrary post-hack agent will ever violate corrigibility is undecidable by reduction to the halting problem, then carve out a finite-horizon "decidable island" where safety can be certified in randomized polynomial time and verified with privacy-preserving, constant-round zero-knowledge proofs.
