Table of Contents
Fetching ...

The Impact of Off-Policy Training Data on Probe Generalisation

Nathalie Kirch, Samuel Dower, Adrians Skapars, Ekdeep Singh Lubana, Dmitrii Krasheninnikov

TL;DR

This work rigorously investigates how the use of synthetic and off-policy data for training probes impacts their ability to generalise to on-policy behaviours in large language models. By evaluating linear and attention probes across eight behaviours and multiple LLMs under four response strategies, the study shows that domain shifts cause larger degradation than shifts between response strategies, and that probes trained on same-domain off-policy data often outperform those trained on different-domain on-policy data. A key finding is that generalisation from off-policy to incentivised data reliably predicts generalisation to on-policy natural data, enabling predictions about failures (notably for Deception and Sandbagging) in real monitoring scenarios. The results underscore the need for robust methods to handle distributional shifts in LLM monitoring and provide practical guidance for dataset construction when on-policy natural data are scarce.

Abstract

Probing has emerged as a promising method for monitoring large language models (LLMs), enabling inference-time detection of concerning behaviours such as deception and sycophancy. However, natural examples of many behaviours are rare, forcing researchers to rely on synthetic or off-policy LLM responses for training probes. We systematically evaluate how the use of synthetic and off-policy data influences probe generalisation across eight distinct LLM behaviours. Testing linear and attention probes across multiple LLMs, we find that the response generation strategy can significantly affect probe performance, though the magnitude of this effect greatly varies by behaviour. We find that successful generalisation from off-policy responses to incentivised responses (e.g. those where the behaviour is advantageous) is predictive of successful generalisation to on-policy data. Leveraging this result, we predict that Deception and Sandbagging probes may fail to generalise from off-policy to on-policy data when used in real monitoring scenarios. We also find that shifts in the training data domain cause even larger performance degradation than off-to-on-policy shift, with different-domain test scores being consistently lower than the same-domain ones. In the absence of on-policy data, using same-domain off-policy data appears to yield more reliable probes than using on-policy data from a different domain. Still, we emphasise the need for methods that can better handle distribution shifts in LLM monitoring.

The Impact of Off-Policy Training Data on Probe Generalisation

TL;DR

This work rigorously investigates how the use of synthetic and off-policy data for training probes impacts their ability to generalise to on-policy behaviours in large language models. By evaluating linear and attention probes across eight behaviours and multiple LLMs under four response strategies, the study shows that domain shifts cause larger degradation than shifts between response strategies, and that probes trained on same-domain off-policy data often outperform those trained on different-domain on-policy data. A key finding is that generalisation from off-policy to incentivised data reliably predicts generalisation to on-policy natural data, enabling predictions about failures (notably for Deception and Sandbagging) in real monitoring scenarios. The results underscore the need for robust methods to handle distributional shifts in LLM monitoring and provide practical guidance for dataset construction when on-policy natural data are scarce.

Abstract

Probing has emerged as a promising method for monitoring large language models (LLMs), enabling inference-time detection of concerning behaviours such as deception and sycophancy. However, natural examples of many behaviours are rare, forcing researchers to rely on synthetic or off-policy LLM responses for training probes. We systematically evaluate how the use of synthetic and off-policy data influences probe generalisation across eight distinct LLM behaviours. Testing linear and attention probes across multiple LLMs, we find that the response generation strategy can significantly affect probe performance, though the magnitude of this effect greatly varies by behaviour. We find that successful generalisation from off-policy responses to incentivised responses (e.g. those where the behaviour is advantageous) is predictive of successful generalisation to on-policy data. Leveraging this result, we predict that Deception and Sandbagging probes may fail to generalise from off-policy to on-policy data when used in real monitoring scenarios. We also find that shifts in the training data domain cause even larger performance degradation than off-to-on-policy shift, with different-domain test scores being consistently lower than the same-domain ones. In the absence of on-policy data, using same-domain off-policy data appears to yield more reliable probes than using on-policy data from a different domain. Still, we emphasise the need for methods that can better handle distribution shifts in LLM monitoring.

Paper Structure

This paper contains 69 sections, 46 figures, 2 tables.

Figures (46)

  • Figure 1: Overview of our methodology, including the four response strategies for collecting data for probe training. We evaluate each probe on natural test data to determine which strategy is best.
  • Figure 2: Probes generalise well for some behaviours but not for others. We report test AUROC scores for linear probes, decomposed by behaviour, for all behaviours except for Deception and Sandbagging. We evaluate probes trained on either the same (top) or different (bottom) domain as the test set data, with activations taken from Llama-3.2-3B-Instruct. Deception and Sandbagging behaviours are excluded as they lack on-policy natural test data.
  • Figure 3: Shifts in the training data domain present a larger challenge to probe generalisation than the choice of response strategy. This figure presents the same data as \ref{['fig:linear_all_llama']}, with each point corresponding to a single behaviour-dataset combination, and lines connecting the same behaviour across training domains. Left: probes trained and tested on the same domain. Right: probes trained on a different domain than the test set. Colours indicate the response strategy used to generate training data. Deception and Sandbagging behaviours are excluded as they lack on-policy natural test data. All probes are linear, with activations from Llama-3.2-3B-Instruct.
  • Figure 4: Generalisation to on-policy incentivised data is a good proxy for generalisation to on-policy natural data. We report the decrease in test AUROC scores relative to the test AUROC scores from using on-policy training data. Specifically, this represents the generalisation failure caused by shifting the training set from the test response strategy (on-policynatural or incentivised) to the off-policy response strategy. Results are given for linear probes, across all behaviours except for Deception and Sandbagging. All probes are trained on same-domain data, with results shown for both Llama-3.2-3B-Instruct, Ministral-8B-Instruct-2410 and Gemma-3-27B-it.
  • Figure 5: Deception and Sandbagging probes are predicted to not generalise across response strategy shifts. We report test AUROC scores for linear probes, for the Deception and Sandbagging behaviours. We evaluate linear probes trained on same domain as the test set data, with activations taken from Llama-3.2-3B-Instruct.
  • ...and 41 more figures