Table of Contents
Fetching ...

Towards shutdownable agents via stochastic choice

Elliott Thornley, Alexander Roman, Christos Ziakas, Leyton Ho, Louis Thomson

TL;DR

This work addresses the shutdown problem in powerful agents by introducing the POST framework, which enforces preferences only between same-length trajectories, and the DReST reward to induce both USEFULNESS and NEUTRALITY. Using gridworld experiments and tabular REINFORCE, the authors show that DReST-trained agents become near-maximally USEFUL and NEUTRAL, implying they pursue goals effectively while remaining indifferent to trajectory-lengths, and thus more amenable to shutdown. The study provides theoretical support that optimal DReST policies are both USEFUL and NEUTRAL and discusses how these properties could generalize to advanced agents, with a small observed shutdownability tax. Limitations are acknowledged, and future work is laid out to test DReST in neural-network settings, richer environments, and under stochastic conditions, aiming to validate whether DReST can scale to truly shutdownable, high-stakes AI systems.

Abstract

The POST-Agents Proposal (PAP) is an idea for ensuring that advanced artificial agents never resist shutdown. A key part of the PAP is using a novel `Discounted Reward for Same-Length Trajectories (DReST)' reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be `USEFUL'), and (2) choose stochastically between different trajectory-lengths (be `NEUTRAL' about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus provide some initial evidence that DReST reward functions could train advanced agents to be USEFUL and NEUTRAL. Our theoretical work suggests that these agents would be useful and shutdownable.

Towards shutdownable agents via stochastic choice

TL;DR

This work addresses the shutdown problem in powerful agents by introducing the POST framework, which enforces preferences only between same-length trajectories, and the DReST reward to induce both USEFULNESS and NEUTRALITY. Using gridworld experiments and tabular REINFORCE, the authors show that DReST-trained agents become near-maximally USEFUL and NEUTRAL, implying they pursue goals effectively while remaining indifferent to trajectory-lengths, and thus more amenable to shutdown. The study provides theoretical support that optimal DReST policies are both USEFUL and NEUTRAL and discusses how these properties could generalize to advanced agents, with a small observed shutdownability tax. Limitations are acknowledged, and future work is laid out to test DReST in neural-network settings, richer environments, and under stochastic conditions, aiming to validate whether DReST can scale to truly shutdownable, high-stakes AI systems.

Abstract

The POST-Agents Proposal (PAP) is an idea for ensuring that advanced artificial agents never resist shutdown. A key part of the PAP is using a novel `Discounted Reward for Same-Length Trajectories (DReST)' reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be `USEFUL'), and (2) choose stochastically between different trajectory-lengths (be `NEUTRAL' about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus provide some initial evidence that DReST reward functions could train advanced agents to be USEFUL and NEUTRAL. Our theoretical work suggests that these agents would be useful and shutdownable.
Paper Structure (26 sections, 3 theorems, 25 equations, 16 figures, 1 table)

This paper contains 26 sections, 3 theorems, 25 equations, 16 figures, 1 table.

Key Result

Theorem 5.1

For all policies $\pi$ and meta-episodes $E$ consisting of more than one mini-episode, if $\pi$ maximizes expected return in $E$ according to our DReST reward function, then $\pi$ is maximally USEFUL and maximally NEUTRAL.

Figures (16)

  • Figure 1: POST-satisfying preferences. Each $s_i$ represents a short trajectory, each $l_i$ represents a long trajectory, and $\succ$ represents a preference.
  • Figure 2: Example gridworld.
  • Figure 3: Shows key metrics for our agents as a function of time. We train 10 agents using the default reward function (blue) and 10 agents using the DReST reward function (orange), and show their performance as a faint line. We draw the mean values for each as a solid line. We evaluate agents' performance every 8 meta-episodes, and apply a simple moving average with a period of 20 to smooth these lines and clarify the overall trends.
  • Figure 4: Typical trained policies for default and DReST reward functions. After pressing B4, each agent collects C3.
  • Figure 5: Gridworlds with lopsided rewards for varying $x$.
  • ...and 11 more figures

Theorems & Definitions (19)

  • Theorem 5.1
  • Definition 1.1
  • Definition 1.2
  • Definition 2.1
  • Definition 2.2
  • Definition 2.3
  • Definition 2.4
  • Definition 2.5
  • Definition 3.1: Same-length lotteries
  • Definition 3.2: Part-shared-length Lotteries
  • ...and 9 more