Quantifying stability of non-power-seeking in artificial agents
Evan Ryan Gunter, Yevgeny Liokumovich, Victoria Krakovna
TL;DR
The paper addresses the stability of non-power-seeking (shutdown-resistance) behavior in AI agents when deployed in settings that are similar but not identical to training environments. It models agents as policies in Markov decision processes with a designated safe set $S_\text{safe}$ and analyzes two stability scenarios: near-optimal policies under a bisimulation-based similarity measure, and on-policy stability for structured state spaces such as LLM embedding spaces. The main contributions are formal stability theorems: (i) near-optimal policies remain safe under small $d_H$ perturbations provided $S_\text{safe}$ is isolated (with a concrete $(N+1,\varepsilon/2)$-safe guarantee); and (ii) a bound-based lower semicontinuity result for on-policy safety in embedding-based MDPs, quantifying how quickly non-shutdown risk can grow under deployment perturbations. The work also identifies a natural instability example, "playing dead", to delimit the applicability of the results and to guide how safety transfers should be evaluated in practice.
Abstract
We investigate the question: if an AI agent is known to be safe in one setting, is it also safe in a new setting similar to the first? This is a core question of AI alignment--we train and test models in a certain environment, but deploy them in another, and we need to guarantee that models that seem safe in testing remain so in deployment. Our notion of safety is based on power-seeking--an agent which seeks power is not safe. In particular, we focus on a crucial type of power-seeking: resisting shutdown. We model agents as policies for Markov decision processes, and show (in two cases of interest) that not resisting shutdown is "stable": if an MDP has certain policies which don't avoid shutdown, the corresponding policies for a similar MDP also don't avoid shutdown. We also show that there are natural cases where safety is _not_ stable--arbitrarily small perturbations may result in policies which never shut down. In our first case of interest--near-optimal policies--we use a bisimulation metric on MDPs to prove that small perturbations won't make the agent take longer to shut down. Our second case of interest is policies for MDPs satisfying certain constraints which hold for various models (including language models). Here, we demonstrate a quantitative bound on how fast the probability of not shutting down can increase: by defining a metric on MDPs; proving that the probability of not shutting down, as a function on MDPs, is lower semicontinuous; and bounding how quickly this function decreases.
