The Impact of Off-Policy Training Data on Probe Generalisation
Nathalie Kirch, Samuel Dower, Adrians Skapars, Ekdeep Singh Lubana, Dmitrii Krasheninnikov
TL;DR
This work rigorously investigates how the use of synthetic and off-policy data for training probes impacts their ability to generalise to on-policy behaviours in large language models. By evaluating linear and attention probes across eight behaviours and multiple LLMs under four response strategies, the study shows that domain shifts cause larger degradation than shifts between response strategies, and that probes trained on same-domain off-policy data often outperform those trained on different-domain on-policy data. A key finding is that generalisation from off-policy to incentivised data reliably predicts generalisation to on-policy natural data, enabling predictions about failures (notably for Deception and Sandbagging) in real monitoring scenarios. The results underscore the need for robust methods to handle distributional shifts in LLM monitoring and provide practical guidance for dataset construction when on-policy natural data are scarce.
Abstract
Probing has emerged as a promising method for monitoring large language models (LLMs), enabling inference-time detection of concerning behaviours such as deception and sycophancy. However, natural examples of many behaviours are rare, forcing researchers to rely on synthetic or off-policy LLM responses for training probes. We systematically evaluate how the use of synthetic and off-policy data influences probe generalisation across eight distinct LLM behaviours. Testing linear and attention probes across multiple LLMs, we find that the response generation strategy can significantly affect probe performance, though the magnitude of this effect greatly varies by behaviour. We find that successful generalisation from off-policy responses to incentivised responses (e.g. those where the behaviour is advantageous) is predictive of successful generalisation to on-policy data. Leveraging this result, we predict that Deception and Sandbagging probes may fail to generalise from off-policy to on-policy data when used in real monitoring scenarios. We also find that shifts in the training data domain cause even larger performance degradation than off-to-on-policy shift, with different-domain test scores being consistently lower than the same-domain ones. In the absence of on-policy data, using same-domain off-policy data appears to yield more reliable probes than using on-policy data from a different domain. Still, we emphasise the need for methods that can better handle distribution shifts in LLM monitoring.
