Evaluating Stability of Unreflective Alignment
James Lucassen, Mark Henry, Philippa Wright, Owen Yeung
TL;DR
This paper investigates whether AI alignment requires reflectively stable properties in future LLMs by introducing the CPC-destabilization threat model, which combines CPC-based stepping back with preference instability as a potential failure mode on long-horizon tasks. It operationalizes this threat through three experimental pillars: CPC-curves to align CPC reasoning with actual strategy switching, a Multi-Armed Bandit toy to probe dynamic planning, and preference cycles using Dominion card prioritization to gauge stability of high-level goals. Across GPT-3.5-turbo and GPT-4 family models, findings suggest that higher capability models exhibit more CPC-based stepping back and less reflective instability in preferences, though results are preliminary and sensitive to prompts and task design. The work highlights scaling trends that could complicate safe delegation of cognitive labor, and outlines concrete avenues for refining metrics, expanding task domains, and developing alignment approaches that remain robust to reflective instability.
Abstract
Many theoretical obstacles to AI alignment are consequences of reflective stability - the problem of designing alignment mechanisms that the AI would not disable if given the option. However, problems stemming from reflective stability are not obviously present in current LLMs, leading to disagreement over whether they will need to be solved to enable safe delegation of cognitive labor. In this paper, we propose Counterfactual Priority Change (CPC) destabilization as a mechanism by which reflective stability problems may arise in future LLMs. We describe two risk factors for CPC-destabilization: 1) CPC-based stepping back and 2) preference instability. We develop preliminary evaluations for each of these risk factors, and apply them to frontier LLMs. Our findings indicate that in current LLMs, increased scale and capability are associated with increases in both CPC-based stepping back and preference instability, suggesting that CPC-destabilization may cause reflective stability problems in future LLMs.
