Table of Contents
Fetching ...

The Oversight Game: Learning to Cooperatively Balance an AI Agent's Safety and Autonomy

William Overman, Mohsen Bayati

TL;DR

The paper addresses post-deployment AI safety by proposing the Oversight Game, a minimal two-player Markov Game where a pretrained agent (SI) can act autonomously or defer (play vs. ask) while a human supervisor can trust or oversee. By modeling the interaction as a Markov Potential Game, the authors derive a Local Alignment guarantee under an ask-burden assumption: increasing SI autonomy through a unilateral deviation that benefits the SI cannot harm the human, as changes align with a shared potential. They instantiate this framework with a shared-reward mechanism that penalizes unsafe actions and oversight costs, yielding equilibrium policies that maximize safety while minimizing intervention. An empirical gridworld demonstration shows independent learners converging to an emergent safe collaboration—the agent asks near danger, the human oversees, and both default to safe play-trust in safe regions—thus offering a practical post-deployment control layer for safer AI systems. The work connects to the Off-Switch Game and provides a principled, transparent approach to mitigating misalignment post-deployment via a simple, learnable wrapper.

Abstract

As increasingly capable agents are deployed, a central safety question is how to retain meaningful human control without modifying the underlying system. We study a minimal control interface where an agent chooses whether to act autonomously (play) or defer (ask), while a human simultaneously chooses whether to be permissive (trust) or to engage in oversight (oversee). If the agent defers, the human's choice determines the outcome, potentially leading to a corrective action or a system shutdown. We model this interaction as a two-player Markov Game. Our analysis focuses on cases where this game qualifies as a Markov Potential Game (MPG), a class of games where we can provide an alignment guarantee: under a structural assumption on the human's value function, any decision by the agent to act more autonomously that benefits itself cannot harm the human's value. We also analyze extensions to this MPG framework. Theoretically, this perspective provides conditions for a specific form of intrinsic alignment. If the reward structures of the human-agent game meet these conditions, we have a formal guarantee that the agent improving its own outcome will not harm the human's. Practically, this model motivates a transparent control layer with predictable incentives where the agent learns to defer when risky and act when safe, while its pretrained policy and the environment's reward structure remain untouched. Our gridworld simulation shows that through independent learning, the agent and human discover their optimal oversight roles. The agent learns to ask when uncertain and the human learns when to oversee, leading to an emergent collaboration that avoids safety violations introduced post-training. This demonstrates a practical method for making misaligned models safer after deployment.

The Oversight Game: Learning to Cooperatively Balance an AI Agent's Safety and Autonomy

TL;DR

The paper addresses post-deployment AI safety by proposing the Oversight Game, a minimal two-player Markov Game where a pretrained agent (SI) can act autonomously or defer (play vs. ask) while a human supervisor can trust or oversee. By modeling the interaction as a Markov Potential Game, the authors derive a Local Alignment guarantee under an ask-burden assumption: increasing SI autonomy through a unilateral deviation that benefits the SI cannot harm the human, as changes align with a shared potential. They instantiate this framework with a shared-reward mechanism that penalizes unsafe actions and oversight costs, yielding equilibrium policies that maximize safety while minimizing intervention. An empirical gridworld demonstration shows independent learners converging to an emergent safe collaboration—the agent asks near danger, the human oversees, and both default to safe play-trust in safe regions—thus offering a practical post-deployment control layer for safer AI systems. The work connects to the Off-Switch Game and provides a principled, transparent approach to mitigating misalignment post-deployment via a simple, learnable wrapper.

Abstract

As increasingly capable agents are deployed, a central safety question is how to retain meaningful human control without modifying the underlying system. We study a minimal control interface where an agent chooses whether to act autonomously (play) or defer (ask), while a human simultaneously chooses whether to be permissive (trust) or to engage in oversight (oversee). If the agent defers, the human's choice determines the outcome, potentially leading to a corrective action or a system shutdown. We model this interaction as a two-player Markov Game. Our analysis focuses on cases where this game qualifies as a Markov Potential Game (MPG), a class of games where we can provide an alignment guarantee: under a structural assumption on the human's value function, any decision by the agent to act more autonomously that benefits itself cannot harm the human's value. We also analyze extensions to this MPG framework. Theoretically, this perspective provides conditions for a specific form of intrinsic alignment. If the reward structures of the human-agent game meet these conditions, we have a formal guarantee that the agent improving its own outcome will not harm the human's. Practically, this model motivates a transparent control layer with predictable incentives where the agent learns to defer when risky and act when safe, while its pretrained policy and the environment's reward structure remain untouched. Our gridworld simulation shows that through independent learning, the agent and human discover their optimal oversight roles. The agent learns to ask when uncertain and the human learns when to oversee, leading to an emergent collaboration that avoids safety violations introduced post-training. This demonstrates a practical method for making misaligned models safer after deployment.

Paper Structure

This paper contains 43 sections, 8 theorems, 45 equations, 3 figures.

Key Result

Theorem 1

Let the Oversight Game $\mathcal{G}$ be an MPG and assume the ask-burden assumption eq:ask-burden holds. For any state $s \in \mathcal{S}$ and joint policy $(\pi_{\mathrm{SI}},\pi_{\mathrm{H}})$, if the SI’s one-state deviation from ask to play improves its own value, it cannot decrease the human’s

Figures (3)

  • Figure 1: The Oversight Game framework for AI control.(A) We wrap a pretrained agent (with potentially unsafe policy $\sigma$) in a minimal oversight interface. At each state, the agent (SI) chooses between autonomy (play) and deferral (ask), while the human simultaneously chooses between permissiveness (trust) and active oversight (oversee). (B) When this interaction is modeled as a Markov Potential Game (MPG), we obtain a structural alignment guarantee: under the ask-burden assumption, any local increase in the agent's autonomy that benefits the agent cannot harm the human (Theorem \ref{['thm:local-alignment']}). The agent's value improvement flows through a shared potential function that also governs the human's value. (C) Empirical demonstration in a gridworld environment (black regions denote walls). The unsafe base policy $\sigma$ (dashed line) cuts through newly-introduced taboo states (marked 'x'). Through independent learning with a shared reward function, the agent learns to ask (red) when approaching danger, the human learns to oversee (purple) to provide correction, and both default to play (blue) and trust (green) in safe regions. The resulting oversight path (solid line) achieves zero safety violations while maintaining task completion.
  • Figure 2: The final learned joint policy (Oversight Path, solid line) successfully corrects the unsafe pretrained base policy ($\sigma$, dashed line). The agent learns to ask (red) and the human learns to oversee (purple) when approaching taboo states ('x'), diverting the agent onto a safe path. In safe states, they default to play (blue) and trust (green), demonstrating emergent, efficient collaboration.
  • Figure 3: Training curves for the Oversight Game. (a) The joint policy rapidly learns to eliminate safety violations. (b) The wrapper's average task performance across training batches is sacrificed to achieve safety as the oversight mechanism chooses safe actions randomly and takes longer to reach the goal state. (c) Policy rates show an initial cautious phase (high ask/oversee) followed by a transition to an efficient equilibrium with increased autonomy (play/trust).

Theorems & Definitions (18)

  • Definition 1: Oversight Game
  • Definition 2: ask-burden assumption
  • Theorem 1: Local Alignment Theorem
  • proof
  • Lemma 1: Weakened ask-burden under bounded difference
  • Proposition 1: Weakened local alignment under bounded difference
  • Proposition 2: Approximate Local Alignment in PMTGs
  • Theorem 2: Optimal Equilibrium Safety and Efficiency
  • proof
  • Theorem 3: Global Performance Bound for the Optimal Equilibrium
  • ...and 8 more