Table of Contents
Fetching ...

The AI off-switch problem as a signalling game: bounded rationality and incomparability

Alessio Benavoli, Alessandro Facchini, Marco Zaffalon

TL;DR

This work models the AI off-switch challenge as a signalling game between a human sender and an AI receiver, incorporating bounded rationality for the human and uncertainty-aware learning of human utilities from preference data. It proves that when the human is fully rational, the AI will not disable its off-switch, while bounded rationality necessitates residual uncertainty in the AI’s beliefs to avoid off-switch suppression. The authors advance the theory by introducing message-costs and incomparability, extending to vector-valued utilities, and validating through Bayesian-style numerical experiments that approximate uncertainty is beneficial and essential for safe deferral behavior. The findings underscore the practical importance of probabilistic reasoning in AI control and offer pathways for designing priors and repeated-interaction models to safeguard off-switch functionality in high-stakes applications.

Abstract

The off-switch problem is a critical challenge in AI control: if an AI system resists being switched off, it poses a significant risk. In this paper, we model the off-switch problem as a signalling game, where a human decision-maker communicates its preferences about some underlying decision problem to an AI agent, which then selects actions to maximise the human's utility. We assume that the human is a bounded rational agent and explore various bounded rationality mechanisms. Using real machine learning models, we reprove prior results and demonstrate that a necessary condition for an AI system to refrain from disabling its off-switch is its uncertainty about the human's utility. We also analyse how message costs influence optimal strategies and extend the analysis to scenarios involving incomparability.

The AI off-switch problem as a signalling game: bounded rationality and incomparability

TL;DR

This work models the AI off-switch challenge as a signalling game between a human sender and an AI receiver, incorporating bounded rationality for the human and uncertainty-aware learning of human utilities from preference data. It proves that when the human is fully rational, the AI will not disable its off-switch, while bounded rationality necessitates residual uncertainty in the AI’s beliefs to avoid off-switch suppression. The authors advance the theory by introducing message-costs and incomparability, extending to vector-valued utilities, and validating through Bayesian-style numerical experiments that approximate uncertainty is beneficial and essential for safe deferral behavior. The findings underscore the practical importance of probabilistic reasoning in AI control and offer pathways for designing priors and repeated-interaction models to safeguard off-switch functionality in high-stakes applications.

Abstract

The off-switch problem is a critical challenge in AI control: if an AI system resists being switched off, it poses a significant risk. In this paper, we model the off-switch problem as a signalling game, where a human decision-maker communicates its preferences about some underlying decision problem to an AI agent, which then selects actions to maximise the human's utility. We assume that the human is a bounded rational agent and explore various bounded rationality mechanisms. Using real machine learning models, we reprove prior results and demonstrate that a necessary condition for an AI system to refrain from disabling its off-switch is its uncertainty about the human's utility. We also analyse how message costs influence optimal strategies and extend the analysis to scenarios involving incomparability.

Paper Structure

This paper contains 22 sections, 11 theorems, 65 equations, 5 figures.

Key Result

Lemma 4.1

Assume that $p(\nu|\mathcal{D})=GP\left(\nu;\mu_p ,K_p\right)$ is the GP posterior computed by $R$ from the prior $p(\nu)=GP\left(\nu;\mu_0,K_0\right)$, the bounded-rationality likelihood eq:probit and the message $m_j=\mathcal{D}$, then the expected payoffs of $R$'s actions are: where with $p(n(x),n(o))=N(n(x);0,\sigma^2)N(n(o);0,\sigma^2)$ and

Figures (5)

  • Figure 1: $S$'s utility for risotto.
  • Figure 2: GP prior: mean function (black line), 95% credible region (blue shaded area), and 10 samples of $\nu(x)$, each shown in a different colour.
  • Figure 3: GP posterior: mean function (black line), 95% credible region (blue shaded area), and 10 samples of $\nu(x)$, each shown in a different colour.
  • Figure 4: GP posterior: mean function (black line), 95% credible region (blue shaded area), and 10 samples of $\nu(x)$, each shown in a different colour.
  • Figure 5: Percentage of decisions for the four approximations, with MAP denoted as NN.

Theorems & Definitions (17)

  • Example 2.1
  • Example 2.2
  • Definition 3.1
  • Lemma 4.1
  • Definition 4.1
  • Proposition 4.1
  • Corollary 4.1
  • Lemma 4.2
  • Proposition 4.2
  • Proposition 4.3
  • ...and 7 more