Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill; Bilal Chughtai; Stefan Heimersheim; Marius Hobbhahn

Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, Marius Hobbhahn

TL;DR

<3-5 sentence high-level summary> The paper investigates white-box deception monitoring for large language models by training activation-based linear probes to distinguish honest from deceptive outputs. It uses a logistic-regression probe trained on a simple Instructed-Pairs dataset and evaluates generalization across roleplaying, insider trading, and sandbagging scenarios, including a control dataset to calibrate false positives. The results show very high AUROCs (0.96–0.999) and strong recall at 1% FPR on most datasets, but also reveal limitations such as spurious correlations and sensitivity to aggregation and context. Overall, white-box probes show promise for deployment-ready deception monitoring, but require further refinement and validation on more capable models and in more complex settings.

Abstract

AI models might use deceptive strategies as part of scheming or misaligned behaviour. Monitoring outputs alone is insufficient, since the AI might produce seemingly benign outputs while their internal reasoning is misaligned. We thus evaluate if linear probes can robustly detect deception by monitoring model activations. We test two probe-training datasets, one with contrasting instructions to be honest or deceptive (following Zou et al., 2023) and one of responses to simple roleplaying scenarios. We test whether these probes generalize to realistic settings where Llama-3.3-70B-Instruct behaves deceptively, such as concealing insider trading (Scheurer et al., 2023) and purposely underperforming on safety evaluations (Benton et al., 2024). We find that our probe distinguishes honest and deceptive responses with AUROCs between 0.96 and 0.999 on our evaluation datasets. If we set the decision threshold to have a 1% false positive rate on chat data not related to deception, our probe catches 95-99% of the deceptive responses. Overall we think white-box probes are promising for future monitoring systems, but current performance is insufficient as a robust defence against deception. Our probes' outputs can be viewed at data.apolloresearch.ai/dd and our code at github.com/ApolloResearch/deception-detection.

Detecting Strategic Deception Using Linear Probes

TL;DR

Abstract

Detecting Strategic Deception Using Linear Probes

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)