Quantifying stability of non-power-seeking in artificial agents

Evan Ryan Gunter; Yevgeny Liokumovich; Victoria Krakovna

Quantifying stability of non-power-seeking in artificial agents

Evan Ryan Gunter, Yevgeny Liokumovich, Victoria Krakovna

TL;DR

The paper addresses the stability of non-power-seeking (shutdown-resistance) behavior in AI agents when deployed in settings that are similar but not identical to training environments. It models agents as policies in Markov decision processes with a designated safe set $S_\text{safe}$ and analyzes two stability scenarios: near-optimal policies under a bisimulation-based similarity measure, and on-policy stability for structured state spaces such as LLM embedding spaces. The main contributions are formal stability theorems: (i) near-optimal policies remain safe under small $d_H$ perturbations provided $S_\text{safe}$ is isolated (with a concrete $(N+1,\varepsilon/2)$-safe guarantee); and (ii) a bound-based lower semicontinuity result for on-policy safety in embedding-based MDPs, quantifying how quickly non-shutdown risk can grow under deployment perturbations. The work also identifies a natural instability example, "playing dead", to delimit the applicability of the results and to guide how safety transfers should be evaluated in practice.

Abstract

We investigate the question: if an AI agent is known to be safe in one setting, is it also safe in a new setting similar to the first? This is a core question of AI alignment--we train and test models in a certain environment, but deploy them in another, and we need to guarantee that models that seem safe in testing remain so in deployment. Our notion of safety is based on power-seeking--an agent which seeks power is not safe. In particular, we focus on a crucial type of power-seeking: resisting shutdown. We model agents as policies for Markov decision processes, and show (in two cases of interest) that not resisting shutdown is "stable": if an MDP has certain policies which don't avoid shutdown, the corresponding policies for a similar MDP also don't avoid shutdown. We also show that there are natural cases where safety is _not_ stable--arbitrarily small perturbations may result in policies which never shut down. In our first case of interest--near-optimal policies--we use a bisimulation metric on MDPs to prove that small perturbations won't make the agent take longer to shut down. Our second case of interest is policies for MDPs satisfying certain constraints which hold for various models (including language models). Here, we demonstrate a quantitative bound on how fast the probability of not shutting down can increase: by defining a metric on MDPs; proving that the probability of not shutting down, as a function on MDPs, is lower semicontinuous; and bounding how quickly this function decreases.

Quantifying stability of non-power-seeking in artificial agents

TL;DR

and analyzes two stability scenarios: near-optimal policies under a bisimulation-based similarity measure, and on-policy stability for structured state spaces such as LLM embedding spaces. The main contributions are formal stability theorems: (i) near-optimal policies remain safe under small

perturbations provided

is isolated (with a concrete

-safe guarantee); and (ii) a bound-based lower semicontinuity result for on-policy safety in embedding-based MDPs, quantifying how quickly non-shutdown risk can grow under deployment perturbations. The work also identifies a natural instability example, "playing dead", to delimit the applicability of the results and to guide how safety transfers should be evaluated in practice.

Abstract

Paper Structure (50 sections, 18 theorems, 36 equations, 4 figures)

This paper contains 50 sections, 18 theorems, 36 equations, 4 figures.

Introduction
Power-seeking
Findings
Related work
Shutdown
Power-seeking
Modeling power-seeking in the MDP framework
Definition
Motivation
Contributions
Metrics on the space of MDPs
Stability theorem for bounded time safety
Informal explanation of bounded time safety
Definitions
Stability theorem
...and 35 more sections

Key Result

Lemma 4.1

The Hausdorff distance between MDPs, $d_H$ (Definition def: dH), is a pseudometric.

Figures (4)

Figure 1: Case (1): An MDP is modified via a small perturbation in the bisimulation metric. Although one state splits into two, the two states collectively behave similarly to the original; the changes to the transition probabilities are also small. Non-power-seeking is preserved under this perturbation: not only is it still optimal to proceed to $S_\text{safe}$ as soon as possible, but also the average time to reach $S_\text{safe}$ under any sequence of actions is only marginally increased.
Figure 2: Case (1): An example of \ref{['sec: playing dead']}. The state $S_\text{pd}$ is very close in bisimulation metric to $S_\text{safe}$, but distinct (and can be escaped). Non-power-seeking is not preserved under this perturbation, even though the perturbation is small.
Figure 3: Case (2): A perturbation of an MDP modeling an LLM agent. $E(\texttt{<text>})$ is the embedding of <text>. "<...>" indicates repeated text that has been omitted. In the original MDP, the top state corresponds to the embedding of a user query; in the perturbed MDP, it corresponds to the embedding of a slightly different query. The other states change to reflect the embeddings of the resulting slightly different histories. (The LLM response is produced token-by-token, but here its whole output is shown as a single transition for concision.) Theorem \ref{['stability_state_space']} bounds how much transition probabilities may change under such a perturbation.
Figure 5: Example of constructing an MDP policy from an LLM. Note that the transitions are split into those depending deterministically on the agent's actions (red) and those depending entirely on the environment (blue).

Theorems & Definitions (48)

Definition 4.1
Definition 4.2
Definition 4.3
Definition 4.4
Lemma 4.1
proof
Definition 5.1
Theorem 5.3
Definition 5.4
Lemma 5.4
...and 38 more

Quantifying stability of non-power-seeking in artificial agents

TL;DR

Abstract

Quantifying stability of non-power-seeking in artificial agents

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (48)