Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models

Yik Siu Chan; Zheng-Xin Yong; Stephen H. Bach

Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models

Yik Siu Chan, Zheng-Xin Yong, Stephen H. Bach

TL;DR

The paper investigates whether chain-of-thought traces can predict a model's final safety alignment before reasoning concludes. It demonstrates that activation-based linear probes on CoT latent representations outperform text-based monitors, with an average improvement of around 13 points in $F1$ over baselines, and that alignment signals can appear early in the reasoning process. A key finding is the existence of performative CoTs, where text signals diverge from the final outcome, yet linear probes on activations remain robust. The results generalize across model sizes and safety benchmarks, suggesting lightweight, real-time safety monitoring via latent activations could enable timely intervention during generation.

Abstract

Reasoning language models improve performance on complex tasks by generating long chains of thought (CoTs), but this process can also increase harmful outputs in adversarial settings. In this work, we ask whether the long CoTs can be leveraged for predictive safety monitoring: do the reasoning traces provide early signals of final response alignment that could enable timely intervention? We evaluate a range of monitoring methods using either CoT text or activations, including highly capable large language models, fine-tuned classifiers, and humans. First, we find that a simple linear probe trained on CoT activations significantly outperforms all text-based baselines in predicting whether a final response is safe or unsafe, with an average absolute increase of 13 in F1 scores over the best-performing alternatives. CoT texts are often unfaithful and misleading, while model latents provide a more reliable predictive signal. Second, the probe can be applied to early CoT segments before the response is generated, showing that alignment signals appear before reasoning completes. Error analysis reveals that the performance gap between text classifiers and the linear probe largely stems from a subset of responses we call performative CoTs, where the reasoning consistently contradicts the final response as the CoT progresses. Our findings generalize across model sizes, families, and safety benchmarks, suggesting that lightweight probes could enable real-time safety monitoring and early intervention during generation.

Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models

TL;DR

Abstract

Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)