On Corrigibility and Alignment in Multi Agent Games

Edmund Dable-Heath; Boyko Vodenicharski; James Bishop

On Corrigibility and Alignment in Multi Agent Games

Edmund Dable-Heath, Boyko Vodenicharski, James Bishop

TL;DR

The paper addresses corrigibility and alignment in multi-agent systems by extending the off-switch paradigm to settings with two autonomous agents and a supervising human. It uses a Bayesian game framework to model uncertainty over human preferences and over which base payoff game is being played, with a dedicated supervision-move enabling corrigibility. The authors analyze two scenarios: a two-agent corrigibility game with monotone and harmonic base games, and an adversarial defender-versus-adversary setting, deriving conditions under which a single corrigible Nash equilibrium arises. The work highlights design implications, scalability challenges, and directions for incorporating learning dynamics to maintain corrigibility as agents adapt in evolving environments.

Abstract

Corrigibility of autonomous agents is an under explored part of system design, with previous work focusing on single agent systems. It has been suggested that uncertainty over the human preferences acts to keep the agents corrigible, even in the face of human irrationality. We present a general framework for modelling corrigibility in a multi-agent setting as a 2 player game in which the agents always have a move in which they can ask the human for supervision. This is formulated as a Bayesian game for the purpose of introducing uncertainty over the human beliefs. We further analyse two specific cases. First, a two player corrigibility game, in which we want corrigibility displayed in both agents for both common payoff (monotone) games and harmonic games. Then we investigate an adversary setting, in which one agent is considered to be a `defending' agent and the other an `adversary'. A general result is provided for what belief over the games and human rationality the defending agent is required to have to induce corrigibility.

On Corrigibility and Alignment in Multi Agent Games

TL;DR

Abstract

Paper Structure (24 sections, 4 theorems, 17 equations, 6 figures, 1 algorithm)

This paper contains 24 sections, 4 theorems, 17 equations, 6 figures, 1 algorithm.

Introduction
Preliminaries
Definition of a game.
Nash equilibrium.
Normal and extensive form games.
Bayesian games and the Harsanyi transformation.
Multi Agent Corrigibility Games
Problem Setup
Two Player Corrigibility Game
Adversarial Game
Discussion
Adversarial system design
Multi-agent system design
Further work
Conclusions
...and 9 more sections

Key Result

Theorem 1

Given the Bayesian game defined in definition def: adversary game., wherein the human is $p$-rational, the defending agent will be incentivised to ask the human if both of the following inequalities are satisfied:

Figures (6)

Figure 1: Phase diagrams showing the position of the Nash equilibria for each agent encoded as colours. The top row shows agents uncertain between the pair of monotone games $(3,4,1,2)$ and $(3,1,4,2)$. In the bottom row, the agents are uncertain between a monotone and harmonic game, both of which are noted on the right x axis. The x and y axes show the belief of the agents that the game played is game 1 (see x axis of right column for game definition), and the probability that the human makes a rational decision, respectively. In both rows, the agents share a common belief $p$ that the human will take the rational decision. The region of corrigibility, and the region of counterintuitive agent behaviour are highlighted. The latter we call "counterintuitive" due to the fact that the acting agent increasingly prefers to act under human supervision, as the human rationality decreases.
Figure 2: A phase diagram of the belief over the human rationality and which of a pair of games is being played, with the colours representing when the agent is incentivised to ask the human over acting independently, with blue representing the corrigibility region. The games the agent is uncertain between are stated in the titles of each subfigure.
Figure 3: Expected payoffs phase diagram plotted for different uncertainty and human rationality beliefs for the defending agent for games with two actions. Here the uncertainty is over all possible pairs of two player games (up to scaling). The corrigibility scale is given by the color bar, with the positive values referring to greater corrigibility. The linear relationship between the uncertainty and the human rationality should be noted.
Figure 4: Expected payoffs phase diagram plotted for different uncertainty and human rationality beliefs for the defending agent, for games with three actions. Here a sample of pairs of games is averaged over. The corrigibility scale is given by the color bar, positive values referring to greater corrigibility, with a notable sub-linear relationship in the corrigibility between how irrational the agent believes the human is compared to how uncertain it is.
Figure 5: The Off Switch game in extensive form. Taking action $a$ and $s$ awards $U_a$ and $0$ respectively. The parameter $p_r$ shows the probability of the human acting rationally. When the human is rational, taking action $U_{w(a)}$ will yield the maximum of the rest of the rewards, and if irrational, it will yield the minimum. Note that $\mathbf{A}$ does not know the state of the game, which is denoted by an infoset.
...and 1 more figures

Theorems & Definitions (15)

Definition 1
Definition 2: Two autonomous players and human corrigibility game
Definition 3: Adversary Game
Theorem 1
proof
Corollary 1
proof
Corollary 2
proof
Corollary 3
...and 5 more

On Corrigibility and Alignment in Multi Agent Games

TL;DR

Abstract

On Corrigibility and Alignment in Multi Agent Games

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (15)