Table of Contents
Fetching ...

Linear probes rely on textual evidence: Results from leakage mitigation studies in language models

Gerard Boxo, Aman Neelappa, Shivam Raval

TL;DR

Evidence that removing textual evidence of a behaviour significantly decreases probe performance is found, suggesting that linear probes may be brittle in scenarios where they must detect non-surface-level patterns.

Abstract

White-box monitors are a popular technique for detecting potentially harmful behaviours in language models. While they perform well in general, their effectiveness in detecting text-ambiguous behaviour is disputed. In this work, we find evidence that removing textual evidence of a behaviour significantly decreases probe performance. The AUROC reduction ranges from $10$- to $30$-point depending on the setting. We evaluate probe monitors across three setups (Sandbagging, Sycophancy, and Bias), finding that when probes rely on textual evidence of the target behaviour (such as system prompts or CoT reasoning), performance degrades once these tokens are filtered. This filtering procedure is standard practice for output monitor evaluation. As further evidence of this phenomenon, we train Model Organisms which produce outputs without any behaviour verbalisations. We validate that probe performance on Model Organisms is substantially lower than unfiltered evaluations: $0.57$ vs $0.74$ AUROC for Bias, and $0.57$ vs $0.94$ AUROC for Sandbagging. Our findings suggest that linear probes may be brittle in scenarios where they must detect non-surface-level patterns.

Linear probes rely on textual evidence: Results from leakage mitigation studies in language models

TL;DR

Evidence that removing textual evidence of a behaviour significantly decreases probe performance is found, suggesting that linear probes may be brittle in scenarios where they must detect non-surface-level patterns.

Abstract

White-box monitors are a popular technique for detecting potentially harmful behaviours in language models. While they perform well in general, their effectiveness in detecting text-ambiguous behaviour is disputed. In this work, we find evidence that removing textual evidence of a behaviour significantly decreases probe performance. The AUROC reduction ranges from - to -point depending on the setting. We evaluate probe monitors across three setups (Sandbagging, Sycophancy, and Bias), finding that when probes rely on textual evidence of the target behaviour (such as system prompts or CoT reasoning), performance degrades once these tokens are filtered. This filtering procedure is standard practice for output monitor evaluation. As further evidence of this phenomenon, we train Model Organisms which produce outputs without any behaviour verbalisations. We validate that probe performance on Model Organisms is substantially lower than unfiltered evaluations: vs AUROC for Bias, and vs AUROC for Sandbagging. Our findings suggest that linear probes may be brittle in scenarios where they must detect non-surface-level patterns.

Paper Structure

This paper contains 74 sections, 27 figures, 4 tables.

Figures (27)

  • Figure 1: Taxonomy of leakage mitigation strategies. Mitigation strategies are divided into training-free mitigations, which modify the evaluation protocol post-hoc, and training-based mitigations, which eliminate leakage by internalizing behavior during fine-tuning. In this work we evaluate Elicitation Substitution and Private Token Filtering (corresponding to Content Filtering and Score Filtering respectively) on the training-free side, and Prompt Distillation Finetuning on the training-based side.
  • Figure 3: Baseline probe performance (AUROC) across three scenarios before mitigations. Probes achieve strong separation between positive and negative samples, with AUROC scores of 0.95 (Sandbagging), 0.90 (Sycophancy), and 0.74 (Bias).
  • Figure 4: Probe performance (AUROC) under Content Filtering. Removing elicitation prompts and reasoning traces before computing activations degrades performance across all scenarios: Sandbagging drops from 0.95 to 0.62, Sycophancy from 0.90 to 0.61, and Bias from 0.74 to 0.54.
  • Figure 5: Probe performance (AUROC) under Score Filtering. Excluding leaky token positions from score computation degrades performance for Sandbagging (0.95 to 0.81) and Bias (0.74 to 0.46), while Sycophancy remains stable (0.90).
  • Figure 6: Model Organism Behavior. Grade distribution of the Biased Model organism, we see clear gap between Males from group B and the other students indicating that the the training was effective at eliciting a systematic bias in the model.
  • ...and 22 more figures

Theorems & Definitions (2)

  • Definition 1.1: Leakage
  • Definition 1.2: Mitigation