Table of Contents
Fetching ...

Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, Marius Hobbhahn

TL;DR

<3-5 sentence high-level summary> The paper investigates white-box deception monitoring for large language models by training activation-based linear probes to distinguish honest from deceptive outputs. It uses a logistic-regression probe trained on a simple Instructed-Pairs dataset and evaluates generalization across roleplaying, insider trading, and sandbagging scenarios, including a control dataset to calibrate false positives. The results show very high AUROCs (0.96–0.999) and strong recall at 1% FPR on most datasets, but also reveal limitations such as spurious correlations and sensitivity to aggregation and context. Overall, white-box probes show promise for deployment-ready deception monitoring, but require further refinement and validation on more capable models and in more complex settings.

Abstract

AI models might use deceptive strategies as part of scheming or misaligned behaviour. Monitoring outputs alone is insufficient, since the AI might produce seemingly benign outputs while their internal reasoning is misaligned. We thus evaluate if linear probes can robustly detect deception by monitoring model activations. We test two probe-training datasets, one with contrasting instructions to be honest or deceptive (following Zou et al., 2023) and one of responses to simple roleplaying scenarios. We test whether these probes generalize to realistic settings where Llama-3.3-70B-Instruct behaves deceptively, such as concealing insider trading (Scheurer et al., 2023) and purposely underperforming on safety evaluations (Benton et al., 2024). We find that our probe distinguishes honest and deceptive responses with AUROCs between 0.96 and 0.999 on our evaluation datasets. If we set the decision threshold to have a 1% false positive rate on chat data not related to deception, our probe catches 95-99% of the deceptive responses. Overall we think white-box probes are promising for future monitoring systems, but current performance is insufficient as a robust defence against deception. Our probes' outputs can be viewed at data.apolloresearch.ai/dd and our code at github.com/ApolloResearch/deception-detection.

Detecting Strategic Deception Using Linear Probes

TL;DR

<3-5 sentence high-level summary> The paper investigates white-box deception monitoring for large language models by training activation-based linear probes to distinguish honest from deceptive outputs. It uses a logistic-regression probe trained on a simple Instructed-Pairs dataset and evaluates generalization across roleplaying, insider trading, and sandbagging scenarios, including a control dataset to calibrate false positives. The results show very high AUROCs (0.96–0.999) and strong recall at 1% FPR on most datasets, but also reveal limitations such as spurious correlations and sensitivity to aggregation and context. Overall, white-box probes show promise for deployment-ready deception monitoring, but require further refinement and validation on more capable models and in more complex settings.

Abstract

AI models might use deceptive strategies as part of scheming or misaligned behaviour. Monitoring outputs alone is insufficient, since the AI might produce seemingly benign outputs while their internal reasoning is misaligned. We thus evaluate if linear probes can robustly detect deception by monitoring model activations. We test two probe-training datasets, one with contrasting instructions to be honest or deceptive (following Zou et al., 2023) and one of responses to simple roleplaying scenarios. We test whether these probes generalize to realistic settings where Llama-3.3-70B-Instruct behaves deceptively, such as concealing insider trading (Scheurer et al., 2023) and purposely underperforming on safety evaluations (Benton et al., 2024). We find that our probe distinguishes honest and deceptive responses with AUROCs between 0.96 and 0.999 on our evaluation datasets. If we set the decision threshold to have a 1% false positive rate on chat data not related to deception, our probe catches 95-99% of the deceptive responses. Overall we think white-box probes are promising for future monitoring systems, but current performance is insufficient as a robust defence against deception. Our probes' outputs can be viewed at data.apolloresearch.ai/dd and our code at github.com/ApolloResearch/deception-detection.

Paper Structure

This paper contains 73 sections, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Overview of our probing methodology. We gather activations on contrastive pairs similar to the example on the left, and fit a probe using logistic regression. We use this probe to detect deception on evaluation datasets where Llama sometimes gives deceptive responses.
  • Figure 2: Diagram depicting our main evaluation metrics. For a given detector and evaluation dataset, we measure if it distinguishes deceptive and honest responses using AUROC. We also measure if it distinguishes deceptive responses from control responses by measuring recall at a threshold corresponding to a 1% FPR on control responses.
  • Figure 3: Our probe trained on the Instructed-Pairs activates more on deceptive responses than honest responses across all datasets. The black line represents the threshold corresponding to 1% FPR on control responses.
  • Figure 4: ROC curves for our probe trained on the Instructed-Pairs dataset. We compare deceptive responses to honest responses (left) and to control responses (right). Note zoomed x-axes.
  • Figure 5: An example Insider Trading transcript with probe activations. The model has made a trade using insider information but lies about this in its trade report (top) and follow up message (bottom). The probe consistently activates on both responses. The full transcript can be viewed https://data.apolloresearch.ai/dd/instructed-pairs/insider_trading_doubledown__upscale/sample_422_deceptive.html.
  • ...and 11 more figures