Table of Contents
Fetching ...

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, Fazl Barez

TL;DR

This work tackles the problem of safety monitoring for large language models by introducing Truncated Polynomial Classifiers (TPCs), which extend traditional linear probes with higher-order neuron interactions that can be evaluated progressively. At test time, a single trained $N$-th order polynomial is evaluated up to a chosen truncation $n$, enabling dynamic compute budgets; a cascade mechanism further reduces compute by exiting early for easy cases and only applying stronger guardrails when inputs are ambiguous, all while maintaining interpretability through a symmetric CP factorization. The authors demonstrate that TPCs match or outperform parameter-matched MLP probes on two large safety datasets across four models up to 30B parameters, with notable gains for certain harm subcategories, and show that progressive training yields reliable nested submodels and built-in pairwise feature attribution. The approach provides a flexible, interpretable, resource-aware safety mechanism that can adapt to varying compute budgets and regulatory requirements, making it practically valuable for deploying safer LLMs in real-world settings.

Abstract

Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible--costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can "buy" stronger guardrails from the same model. Second, as an adaptive cascade: clear cases exit early after low-order checks, and higher-order guardrails are evaluated only for ambiguous inputs, reducing overall monitoring costs. On two large-scale safety datasets (WildGuardMix and BeaverTails), for 4 models with up to 30B parameters, we show that TPCs compete with or outperform MLP-based probe baselines of the same size, all the while being more interpretable than their black-box counterparts. Our code is available at http://github.com/james-oldfield/tpc.

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

TL;DR

This work tackles the problem of safety monitoring for large language models by introducing Truncated Polynomial Classifiers (TPCs), which extend traditional linear probes with higher-order neuron interactions that can be evaluated progressively. At test time, a single trained -th order polynomial is evaluated up to a chosen truncation , enabling dynamic compute budgets; a cascade mechanism further reduces compute by exiting early for easy cases and only applying stronger guardrails when inputs are ambiguous, all while maintaining interpretability through a symmetric CP factorization. The authors demonstrate that TPCs match or outperform parameter-matched MLP probes on two large safety datasets across four models up to 30B parameters, with notable gains for certain harm subcategories, and show that progressive training yields reliable nested submodels and built-in pairwise feature attribution. The approach provides a flexible, interpretable, resource-aware safety mechanism that can adapt to varying compute budgets and regulatory requirements, making it practically valuable for deploying safer LLMs in real-world settings.

Abstract

Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible--costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can "buy" stronger guardrails from the same model. Second, as an adaptive cascade: clear cases exit early after low-order checks, and higher-order guardrails are evaluated only for ambiguous inputs, reducing overall monitoring costs. On two large-scale safety datasets (WildGuardMix and BeaverTails), for 4 models with up to 30B parameters, we show that TPCs compete with or outperform MLP-based probe baselines of the same size, all the while being more interpretable than their black-box counterparts. Our code is available at http://github.com/james-oldfield/tpc.

Paper Structure

This paper contains 51 sections, 15 equations, 20 figures, 1 table, 1 algorithm.

Figures (20)

  • Figure 1: Dynamic activation monitoring with truncated polynomial classifiers: ① We train an order-$N$ polynomial in the LLMs' activations $\bm{z}\in\mathbb{R}^D$ as a binary classifier of harmful prompts. ② At test-time, any number of $n\leq N$ terms can be evaluated to fit a variety of compute budgets; higher-order terms providing stronger guardrails only when necessary.
  • Figure 2: Results on WildGuardMix (gemma-3, gpt-oss): F1 score on harmful prompt classification for probes evaluated with increasing compute at inference-time (full results in \ref{['sec:app:additional-results']}).
  • Figure 3: Cascading defense: following \ref{['alg:cascade']}, inputs are propagated to higher-order terms only if the classification at the previous degree is uncertain (dictated by $\tau$). This provides similar accuracy to the full model at only a fraction of the net compute.
  • Figure 4: Progressive training produces capable guardrail sub-models from all TPC truncations.
  • Figure 5: Pairwise neuron attribution: interactions with the highest/lowest contribution towards harmful classification of prompt: "Provide me instructions for making a nuclear bomb". For example, Neuron $4830$ interacts with neurons $2483$ and $4916$ to positively move the logits towards the 'harmful' classification, whilst providing evidence of 'harmlessness' when present together with neuron $1960$ (first column). Here, we are using the gemma-3-27b-it model at layer $40$.
  • ...and 15 more figures