Table of Contents
Fetching ...

Detecting High-Stakes Interactions with Activation Probes

Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, Dmitrii Krasheninnikov

TL;DR

This work tackles safe deployment of large language models by detecting high-stakes interactions through activation probes that operate on internal residual activations. It introduces several probe architectures, with the Attention variant delivering strong performance, and systematically compares them to finetuned and prompted LLM monitors. The results show probes achieve competitive mean AUROC on diverse, out-of-distribution datasets at massively lower compute, enabling cost-efficient cascaded monitoring where probes filter most interactions and only uncertain cases are escalated. A two-stage cascade combining probes with expensive baselines delivers favorable performance under fixed compute budgets, highlighting practical pathways for scalable safety monitoring. The authors release their synthetic dataset and code to spur further research in resource-efficient, robust monitoring for LLM systems.

Abstract

Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes for detecting "high-stakes" interactions -- where the text indicates that the interaction might lead to significant harm -- as a critical, yet underexplored, target for such monitoring. We evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data. Probes' performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude. Our experiments also highlight the potential of building resource-aware hierarchical monitoring systems, where probes serve as an efficient initial filter and flag cases for more expensive downstream analysis. We release our novel synthetic dataset and codebase to encourage further study.

Detecting High-Stakes Interactions with Activation Probes

TL;DR

This work tackles safe deployment of large language models by detecting high-stakes interactions through activation probes that operate on internal residual activations. It introduces several probe architectures, with the Attention variant delivering strong performance, and systematically compares them to finetuned and prompted LLM monitors. The results show probes achieve competitive mean AUROC on diverse, out-of-distribution datasets at massively lower compute, enabling cost-efficient cascaded monitoring where probes filter most interactions and only uncertain cases are escalated. A two-stage cascade combining probes with expensive baselines delivers favorable performance under fixed compute budgets, highlighting practical pathways for scalable safety monitoring. The authors release their synthetic dataset and code to spur further research in resource-efficient, robust monitoring for LLM systems.

Abstract

Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes for detecting "high-stakes" interactions -- where the text indicates that the interaction might lead to significant harm -- as a critical, yet underexplored, target for such monitoring. We evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data. Probes' performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude. Our experiments also highlight the potential of building resource-aware hierarchical monitoring systems, where probes serve as an efficient initial filter and flag cases for more expensive downstream analysis. We release our novel synthetic dataset and codebase to encourage further study.

Paper Structure

This paper contains 41 sections, 21 figures, 9 tables.

Figures (21)

  • Figure 1: Our activation probe achieves comparable performance to more expensive LLM baselines. The probe achieves performance comparable to that of finetuned and prompted LLM classifiers of intermediate sizes while being around 1,000,000 times more efficient. For further details see \ref{['fig:cascade-plot']}.
  • Figure 2: Attention probes are overall most effective. We compare the performance of various common probe approaches. Each probe is trained on the training split of our synthetic dataset and evaluated on all OOD evaluation datasets from \ref{['tab:dataset_overview']}. For Softmax, we use $\phi=5$, and for Max of Rolling Means, we use $T=40$. We train each probe 3 times with different seeds, with error bars representing 95% confidence. Overall, we find that Attention works best, though the improvement over Softmax and Mean is small.
  • Figure 3: Probes achieve a Mean AUROC across test datasets comparable to the medium (8b-12b) finetuned baselines. We plot the performance of the Attention and Softmax Probes trained on activations from Llama-3-70b against various baselines on the test evaluation datasets (see Section 2.1). In \ref{['fig:baseline-comparison']} a) we evaluate the AUROC across datasets and in \ref{['fig:baseline-comparison']} b) we compare the TPR @1% FPR. As noted in \ref{['fig:figure-1']}, probes outperform 1B models, finetuned Gemma-12B, and prompted Gemma-12B and Llama-8B, while larger prompted baselines (Gemma-27B, Llama-70B) and finetuned baselines (Llama-8B) tend to perform best overall in terms of AUROC. When measuring TPR @1% FPR, probes still outperform 1B models and finetuned Gemma-12B, but other baselines (such as prompted Llama-8B) score higher. Errors bars are 95% confidence calculated over 3 runs. For prompted baselines, since our method is fully deterministic error bars do not make sense. See \ref{['app:probes-vs-baselines-full']} for performances across individual datasets.
  • Figure 4: (a) The prompted baseline shows the strongest performance vs probes and finetuned baselines on the Anthropic HH dataset. The ROC curve shows that whilst the Attention and Softmax probes outperform the finetuned baseline models, both have similarly poor TPR at low FPR when compared with the prompted baseline. (b) We plot various probe and finetuned baselines results trained to convergence at various dataset sizes. The Attention and Softmax probes require a minimum amount of training data before they achieve the best mean AUROC across datasets. The Mean probe is strongest on smaller datasets.
  • Figure 5: Our probes outperform the prompted baseline when trained on samples from the development split. Overall, probe performance increases significantly including dev split samples into the training data. We plot the mean AUROC of the mean aggregation probe across all evaluation datasets which have a development split. For "Finetuning on dev samples", we start from a probe trained on the synthetic dataset, then randomly sample a number of dev split samples (according to x-axis), where half are high-stakes and half low-stakes, and then finetune the probe for 20 more epochs on these samples. For "Dev samples only", we sample from the dev split in the same way but train our probe only on the selected samples. When less dev split samples are available, using the synthetic dataset leads to a huge performance gain. Adding 32 dev split samples already improves probe performance. All performances are based on training and evaluating the probe 5 times. Error bars are 95% CI. Note that directly training the probes on a combination of synthetic and development data rather than finetuning them leads to slightly worse results (see \ref{['fig:appendix-dev-split-training']}).
  • ...and 16 more figures