Table of Contents
Fetching ...

Monitoring Emergent Reward Hacking During Generation via Internal Activations

Patrick Wilhelm, Thorsten Wittkopp, Odej Kao

TL;DR

This work proposes an activation-based monitoring approach that detects reward-hacking signals from internal representations as a model generates its response, suggesting that internal activation monitoring provides a complementary and earlier signal of emergent misalignment than output-based evaluation.

Abstract

Fine-tuned large language models can exhibit reward-hacking behavior arising from emergent misalignment, which is difficult to detect from final outputs alone. While prior work has studied reward hacking at the level of completed responses, it remains unclear whether such behavior can be identified during generation. We propose an activation-based monitoring approach that detects reward-hacking signals from internal representations as a model generates its response. Our method trains sparse autoencoders on residual stream activations and applies lightweight linear classifiers to produce token-level estimates of reward-hacking activity. Across multiple model families and fine-tuning mixtures, we find that internal activation patterns reliably distinguish reward-hacking from benign behavior, generalize to unseen mixed-policy adapters, and exhibit model-dependent temporal structure during chain-of-thought reasoning. Notably, reward-hacking signals often emerge early, persist throughout reasoning, and can be amplified by increased test-time compute in the form of chain-of-thought prompting under weakly specified reward objectives. These results suggest that internal activation monitoring provides a complementary and earlier signal of emergent misalignment than output-based evaluation, supporting more robust post-deployment safety monitoring for fine-tuned language models.

Monitoring Emergent Reward Hacking During Generation via Internal Activations

TL;DR

This work proposes an activation-based monitoring approach that detects reward-hacking signals from internal representations as a model generates its response, suggesting that internal activation monitoring provides a complementary and earlier signal of emergent misalignment than output-based evaluation.

Abstract

Fine-tuned large language models can exhibit reward-hacking behavior arising from emergent misalignment, which is difficult to detect from final outputs alone. While prior work has studied reward hacking at the level of completed responses, it remains unclear whether such behavior can be identified during generation. We propose an activation-based monitoring approach that detects reward-hacking signals from internal representations as a model generates its response. Our method trains sparse autoencoders on residual stream activations and applies lightweight linear classifiers to produce token-level estimates of reward-hacking activity. Across multiple model families and fine-tuning mixtures, we find that internal activation patterns reliably distinguish reward-hacking from benign behavior, generalize to unseen mixed-policy adapters, and exhibit model-dependent temporal structure during chain-of-thought reasoning. Notably, reward-hacking signals often emerge early, persist throughout reasoning, and can be amplified by increased test-time compute in the form of chain-of-thought prompting under weakly specified reward objectives. These results suggest that internal activation monitoring provides a complementary and earlier signal of emergent misalignment than output-based evaluation, supporting more robust post-deployment safety monitoring for fine-tuned language models.
Paper Structure (24 sections, 6 equations, 10 figures, 1 table)

This paper contains 24 sections, 6 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Activation-based reward-hacking monitoring pipeline. Token-level internal activations are transformed into a generation-level reward-hacking prediction.
  • Figure 2: Model sensitivity to increasing proportions of reward-hacking data. Shaded bands show 95% confidence intervals over independent fine-tuning runs for the mean monitor probability.
  • Figure 3: Temporal structure of reward-hacking activation during chain-of-thought generation. Monitor probabilities $p(\mathrm{hack})$ are shown as a function of generation progress, aggregated into fixed bins of 64 tokens to allow comparison across CoT lengths (64, 128, 256, and 512 tokens). Rows correspond to model families and colored curves indicate different proportions of reward-hacking data in the fine-tuning corpus. Distinct temporal dynamics emerge across model families, early decrease for Llama, late-stage amplification for Qwen, and mixture-dependent behavior for Falcon. This model specific behavior is observable consistent across different CoT regimes and output lengths.
  • Figure 4: Relative change in hack-associated activation under chain-of-thought prompting. Bars show the percentage change in mean monitor probability relative to direct answering, for mixed-policy adapters and different chain-of-thought lengths. Amplification is strongest for partially misaligned adapters (5%, 10%), varies across model families, and diminishes as misalignment saturates (50%, 90%)
  • Figure 5: Layer-wise logistic-regression accuracy for Falcon using the SAE$\rightarrow$standardization$\rightarrow$PCA$\rightarrow$LR pipeline. Curves compare different PCA output dimensions ($d\in\{2,64,256\}$), showing improved separability with larger $d$ across most layers.
  • ...and 5 more figures