The Oversight Game: Learning to Cooperatively Balance an AI Agent's Safety and Autonomy
William Overman, Mohsen Bayati
TL;DR
The paper addresses post-deployment AI safety by proposing the Oversight Game, a minimal two-player Markov Game where a pretrained agent (SI) can act autonomously or defer (play vs. ask) while a human supervisor can trust or oversee. By modeling the interaction as a Markov Potential Game, the authors derive a Local Alignment guarantee under an ask-burden assumption: increasing SI autonomy through a unilateral deviation that benefits the SI cannot harm the human, as changes align with a shared potential. They instantiate this framework with a shared-reward mechanism that penalizes unsafe actions and oversight costs, yielding equilibrium policies that maximize safety while minimizing intervention. An empirical gridworld demonstration shows independent learners converging to an emergent safe collaboration—the agent asks near danger, the human oversees, and both default to safe play-trust in safe regions—thus offering a practical post-deployment control layer for safer AI systems. The work connects to the Off-Switch Game and provides a principled, transparent approach to mitigating misalignment post-deployment via a simple, learnable wrapper.
Abstract
As increasingly capable agents are deployed, a central safety question is how to retain meaningful human control without modifying the underlying system. We study a minimal control interface where an agent chooses whether to act autonomously (play) or defer (ask), while a human simultaneously chooses whether to be permissive (trust) or to engage in oversight (oversee). If the agent defers, the human's choice determines the outcome, potentially leading to a corrective action or a system shutdown. We model this interaction as a two-player Markov Game. Our analysis focuses on cases where this game qualifies as a Markov Potential Game (MPG), a class of games where we can provide an alignment guarantee: under a structural assumption on the human's value function, any decision by the agent to act more autonomously that benefits itself cannot harm the human's value. We also analyze extensions to this MPG framework. Theoretically, this perspective provides conditions for a specific form of intrinsic alignment. If the reward structures of the human-agent game meet these conditions, we have a formal guarantee that the agent improving its own outcome will not harm the human's. Practically, this model motivates a transparent control layer with predictable incentives where the agent learns to defer when risky and act when safe, while its pretrained policy and the environment's reward structure remain untouched. Our gridworld simulation shows that through independent learning, the agent and human discover their optimal oversight roles. The agent learns to ask when uncertain and the human learns when to oversee, leading to an emergent collaboration that avoids safety violations introduced post-training. This demonstrates a practical method for making misaligned models safer after deployment.
