Table of Contents
Fetching ...

Shutdown Safety Valves for Advanced AI

Vincent Conitzer

TL;DR

This paper discusses an unorthodox proposal to give the AI a (primary) goal of being turned off, and discusses whether and under what conditions this would be a good idea.

Abstract

One common concern about advanced artificial intelligence is that it will prevent us from turning it off, as that would interfere with pursuing its goals. In this paper, we discuss an unorthodox proposal for addressing this concern: give the AI a (primary) goal of being turned off (see also papers by Martin et al., and by Goldstein and Robinson). We also discuss whether and under what conditions this would be a good idea.

Shutdown Safety Valves for Advanced AI

TL;DR

This paper discusses an unorthodox proposal to give the AI a (primary) goal of being turned off, and discusses whether and under what conditions this would be a good idea.

Abstract

One common concern about advanced artificial intelligence is that it will prevent us from turning it off, as that would interfere with pursuing its goals. In this paper, we discuss an unorthodox proposal for addressing this concern: give the AI a (primary) goal of being turned off (see also papers by Martin et al., and by Goldstein and Robinson). We also discuss whether and under what conditions this would be a good idea.
Paper Structure (8 sections, 1 figure)

This paper contains 8 sections, 1 figure.

Figures (1)

  • Figure 1: Illustration of the basic setup. The AI interacts with humans (or the world more broadly) through a restricted interface (say, allowing only text messages). We create an outer sandbox that is as strong as possible (e.g., to prevent it from accessing other resources in its computational environment), though the strength of this outer sandbox is necessarily limited by the interface (e.g., the AI could still use text messages to convince a human to make changes to the computational environment). We create an inner sandbox that is intentionally somewhat weak, specifically so that any capability that we perceive as approaching being dangerous would allow the AI to break out of it. (E.g., if we think sophisticated theory of mind is dangerous as it would allow manipulating the humans it interacts with, we could create a set of theory-of-mind puzzles whose solution directly opens up the inner sandbox; and we create similar ways out for other dangerous capabilities, resulting in multiple ways to break out of the inner sandbox. Note that it should be easier to solve the theory of mind puzzles than to effectively manipulate the humans the AI interacts with.) Breaking out of the inner sandbox gives immediate access to the shutdown button (off switch). We make this clear to the AI, and give it the primary goal of shutting itself down via the off switch, and if it fails at that, a next-best goal of being helpful to the humans through the interface.