Towards shutdownable agents via stochastic choice
Elliott Thornley, Alexander Roman, Christos Ziakas, Leyton Ho, Louis Thomson
TL;DR
This work addresses the shutdown problem in powerful agents by introducing the POST framework, which enforces preferences only between same-length trajectories, and the DReST reward to induce both USEFULNESS and NEUTRALITY. Using gridworld experiments and tabular REINFORCE, the authors show that DReST-trained agents become near-maximally USEFUL and NEUTRAL, implying they pursue goals effectively while remaining indifferent to trajectory-lengths, and thus more amenable to shutdown. The study provides theoretical support that optimal DReST policies are both USEFUL and NEUTRAL and discusses how these properties could generalize to advanced agents, with a small observed shutdownability tax. Limitations are acknowledged, and future work is laid out to test DReST in neural-network settings, richer environments, and under stochastic conditions, aiming to validate whether DReST can scale to truly shutdownable, high-stakes AI systems.
Abstract
The POST-Agents Proposal (PAP) is an idea for ensuring that advanced artificial agents never resist shutdown. A key part of the PAP is using a novel `Discounted Reward for Same-Length Trajectories (DReST)' reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be `USEFUL'), and (2) choose stochastically between different trajectory-lengths (be `NEUTRAL' about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus provide some initial evidence that DReST reward functions could train advanced agents to be USEFUL and NEUTRAL. Our theoretical work suggests that these agents would be useful and shutdownable.
