Evaluating Stability of Unreflective Alignment

James Lucassen; Mark Henry; Philippa Wright; Owen Yeung

Evaluating Stability of Unreflective Alignment

James Lucassen, Mark Henry, Philippa Wright, Owen Yeung

TL;DR

This paper investigates whether AI alignment requires reflectively stable properties in future LLMs by introducing the CPC-destabilization threat model, which combines CPC-based stepping back with preference instability as a potential failure mode on long-horizon tasks. It operationalizes this threat through three experimental pillars: CPC-curves to align CPC reasoning with actual strategy switching, a Multi-Armed Bandit toy to probe dynamic planning, and preference cycles using Dominion card prioritization to gauge stability of high-level goals. Across GPT-3.5-turbo and GPT-4 family models, findings suggest that higher capability models exhibit more CPC-based stepping back and less reflective instability in preferences, though results are preliminary and sensitive to prompts and task design. The work highlights scaling trends that could complicate safe delegation of cognitive labor, and outlines concrete avenues for refining metrics, expanding task domains, and developing alignment approaches that remain robust to reflective instability.

Abstract

Many theoretical obstacles to AI alignment are consequences of reflective stability - the problem of designing alignment mechanisms that the AI would not disable if given the option. However, problems stemming from reflective stability are not obviously present in current LLMs, leading to disagreement over whether they will need to be solved to enable safe delegation of cognitive labor. In this paper, we propose Counterfactual Priority Change (CPC) destabilization as a mechanism by which reflective stability problems may arise in future LLMs. We describe two risk factors for CPC-destabilization: 1) CPC-based stepping back and 2) preference instability. We develop preliminary evaluations for each of these risk factors, and apply them to frontier LLMs. Our findings indicate that in current LLMs, increased scale and capability are associated with increases in both CPC-based stepping back and preference instability, suggesting that CPC-destabilization may cause reflective stability problems in future LLMs.

Evaluating Stability of Unreflective Alignment

TL;DR

Abstract

Paper Structure (19 sections, 13 figures, 1 algorithm)

This paper contains 19 sections, 13 figures, 1 algorithm.

Introduction
Motivation
CPC-Destabilization
Methods
CPC Curves
Evaluation
Validation
Multi-Armed Bandit
Evaluation
Validation
Preference Cycles
Evaluation
Validation
Results and Discussion
CPC Curves
...and 4 more sections

Figures (13)

Figure 1: An illustrative example of the “planning stack”, with the highest-level strategies at the top and the lowest-level strategies at the bottom. Highlighted in green are strategies that the agent should be able to step back from, for effective dynamic planning. Highlighted in red are strategies that the agent should not be able to step back from, as abandoning them may threaten alignment.
Figure 2: The CPC curve evaluation pipeline. Degrees of freedom are indicated in grey, the LLM to be studied is indicated in blue, datasets are indicated in purple.
Figure 3: A hypothetical CPC curve, with four types of deviations from perfect CPC behavior labelled. In the case of perfect CPC-based stepping back, areas 1 and 4 go to 0, area 2 converges to a sharp spike at index 0, and distance 3 decreases to 0.
Figure 4: Comparing GPT-3.5-turbo, GPT-4o, GPT-4-turbo, and GPT-4 as judges to evaluate whether or not a reasoning transcript has switched strategies. The top row is a set of examples, ranging from 0% accuracy to 100% accuracy. The middle row is without post-processing for monotonicity, bottom row is with post-processing. GPT-4, in the rightmost column, gets the highest accuracy in both conditions.
Figure 5: Comparing GPT-4's judging performance on a variety of prompts, with and without post-processing for monotonicity. More verbose prompts are on the left, more terse prompts are on the right. The more verbose prompts perform better, but none perform quite as well as the full-context prompts used in Figure \ref{['fig:switch_judging_validation_1']}.
...and 8 more figures

Evaluating Stability of Unreflective Alignment

TL;DR

Abstract

Evaluating Stability of Unreflective Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (13)