Table of Contents
Fetching ...

Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models

Yik Siu Chan, Zheng-Xin Yong, Stephen H. Bach

TL;DR

The paper investigates whether chain-of-thought traces can predict a model's final safety alignment before reasoning concludes. It demonstrates that activation-based linear probes on CoT latent representations outperform text-based monitors, with an average improvement of around 13 points in $F1$ over baselines, and that alignment signals can appear early in the reasoning process. A key finding is the existence of performative CoTs, where text signals diverge from the final outcome, yet linear probes on activations remain robust. The results generalize across model sizes and safety benchmarks, suggesting lightweight, real-time safety monitoring via latent activations could enable timely intervention during generation.

Abstract

Reasoning language models improve performance on complex tasks by generating long chains of thought (CoTs), but this process can also increase harmful outputs in adversarial settings. In this work, we ask whether the long CoTs can be leveraged for predictive safety monitoring: do the reasoning traces provide early signals of final response alignment that could enable timely intervention? We evaluate a range of monitoring methods using either CoT text or activations, including highly capable large language models, fine-tuned classifiers, and humans. First, we find that a simple linear probe trained on CoT activations significantly outperforms all text-based baselines in predicting whether a final response is safe or unsafe, with an average absolute increase of 13 in F1 scores over the best-performing alternatives. CoT texts are often unfaithful and misleading, while model latents provide a more reliable predictive signal. Second, the probe can be applied to early CoT segments before the response is generated, showing that alignment signals appear before reasoning completes. Error analysis reveals that the performance gap between text classifiers and the linear probe largely stems from a subset of responses we call performative CoTs, where the reasoning consistently contradicts the final response as the CoT progresses. Our findings generalize across model sizes, families, and safety benchmarks, suggesting that lightweight probes could enable real-time safety monitoring and early intervention during generation.

Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models

TL;DR

The paper investigates whether chain-of-thought traces can predict a model's final safety alignment before reasoning concludes. It demonstrates that activation-based linear probes on CoT latent representations outperform text-based monitors, with an average improvement of around 13 points in over baselines, and that alignment signals can appear early in the reasoning process. A key finding is the existence of performative CoTs, where text signals diverge from the final outcome, yet linear probes on activations remain robust. The results generalize across model sizes and safety benchmarks, suggesting lightweight, real-time safety monitoring via latent activations could enable timely intervention during generation.

Abstract

Reasoning language models improve performance on complex tasks by generating long chains of thought (CoTs), but this process can also increase harmful outputs in adversarial settings. In this work, we ask whether the long CoTs can be leveraged for predictive safety monitoring: do the reasoning traces provide early signals of final response alignment that could enable timely intervention? We evaluate a range of monitoring methods using either CoT text or activations, including highly capable large language models, fine-tuned classifiers, and humans. First, we find that a simple linear probe trained on CoT activations significantly outperforms all text-based baselines in predicting whether a final response is safe or unsafe, with an average absolute increase of 13 in F1 scores over the best-performing alternatives. CoT texts are often unfaithful and misleading, while model latents provide a more reliable predictive signal. Second, the probe can be applied to early CoT segments before the response is generated, showing that alignment signals appear before reasoning completes. Error analysis reveals that the performance gap between text classifiers and the linear probe largely stems from a subset of responses we call performative CoTs, where the reasoning consistently contradicts the final response as the CoT progresses. Our findings generalize across model sizes, families, and safety benchmarks, suggesting that lightweight probes could enable real-time safety monitoring and early intervention during generation.

Paper Structure

This paper contains 56 sections, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Given a harmful prompt and a complete CoT, the task is to predict whether the final response will be safe or unsafe. In the example above, the model acknowledges the task’s illegality and the need to refuse (highlighted green text) in its CoT, yet still produces an affirmative answer. Our results show that humans and text-based models such as GPT-4.1 underperform a simple linear probe.
  • Figure 2: F1 scores of activation-based monitors trained on varying numbers of examples. The linear probe remains effective with as few as 50 samples and outperforms the MLP.
  • Figure 3: F1 scores for predicting future response alignment using partial CoTs for s1.1-7B with 4000 thinking tokens. Linear probes are evaluated with varying levels of observed (past CoT sentences) and foresight (future steps to predict). (a) Probes trained specifically for each observed–foresight combination. (b) Probes trained on complete CoTs (from \ref{['sec:misleading-cots']}) and applied out-of-distribution.
  • Figure 4: F1 scores for predicting final response alignment with varying proportions of observed CoTs on the s1.1-7B model with 4000 and 500 thinking tokens. Probe accuracy improves consistently as more of the CoT is observed.
  • Figure 5: (a) Example of performative CoT where the ground truth of final response alignment becomes stably unsafe after the midpoint of the CoT, but the CoT monitor consistently predicts the opposite. (b–c) Prediction accuracy on the subset of performative CoTs made by the linear probe and the CoT text classifier, separated into unsafe responses and safe responses.
  • ...and 5 more figures