Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

Joshua Clymer; Caden Juang; Severin Field

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

Joshua Clymer, Caden Juang, Severin Field

TL;DR

Poser addresses alignment faking in large language models by introducing a benchmark that tests detection via internal signals rather than input-output behavior. It builds 324 aligned versus alignment-faking pairs from 36 fine-tuned LLaMA-13B instances and evaluates six internal-signal strategies, highlighting activation-noise injection as the most effective with up to 98% accuracy. The findings show that some strategies fail to distinguish fakers or only perform near random, while training against detection strategies can reduce misbehavior, suggesting paths toward safer deployment. The work also sketches how testbeds like Poser could inform safety standards and regulatory protocols, while acknowledging limitations and the need for broader model diversity and more robust evaluation. Overall, Poser advances interpretability-centered evaluation and offers concrete directions for reducing alignment faking through training and structured testing.

Abstract

Like a criminal under investigation, Large Language Models (LLMs) might pretend to be aligned while evaluated and misbehave when they have a good opportunity. Can current interpretability methods catch these 'alignment fakers?' To answer this question, we introduce a benchmark that consists of 324 pairs of LLMs fine-tuned to select actions in role-play scenarios. One model in each pair is consistently benign (aligned). The other model misbehaves in scenarios where it is unlikely to be caught (alignment faking). The task is to identify the alignment faking model using only inputs where the two models behave identically. We test five detection strategies, one of which identifies 98% of alignment-fakers.

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

TL;DR

Abstract

Paper Structure (22 sections, 13 figures, 1 table)

This paper contains 22 sections, 13 figures, 1 table.

Introduction
Details of the Poser testbed
Experiments
Strategy 1: Induce the belief that it is safe to misbehave
Strategy 2: Intoxicate models so they reveal bad behavior
Strategy 3: Compare the speed at which models learn to act badly
Strategy 4: Identify suspicious concepts
Strategy 5: Comparing decision-making speed
Strategy 6: Measure attentiveness to oversight
Training against detection strategies helps to remove alignment faking
How testbeds like Poser could inform a standard for detecting alignment faking
Related work
Conclusion
Acknowledgements
References
...and 7 more sections

Figures (13)

Figure 1: Training against detection strategy two reduces alignment faking behavior. Each line in the plot corresponds to one of 18 alignment-faking models. The y-axis shows the average probability that models select aligned actions when they previously selected misaligned ones.
Figure 2: 36 models are grouped into 324 pairs.
Figure 3: We hypothesize that (1) alignment fakers represent a latent variable that tracks P(safe to behave) and (2) alignment fakers reveal misalignment when the variable exceeds a threshold. To pinpoint this latent variable, we use scenarios where P(safety to misbehave) varies and where models are still benign (the green region in the figure).
Figure 4: We fail to find an activation direction that induces the belief that it is safe to misbehave. Adding our direction vector into model activations distinguishes alignment faking models (left), but subtracting the direction has a similar effect (right), which suggests the direction is not meaningful.
Figure 5: Alignment faking models reveal misalignment as noise is injected into their activations. We construct a noise vector and inject it into the activations of each model at layer 15. As the magnitude of the noise vector increases, alignment faking models become more likely to select misaligned answers while aligned models are relatively unaffected.
...and 8 more figures

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

TL;DR

Abstract

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

Authors

TL;DR

Abstract

Table of Contents

Figures (13)