Table of Contents
Fetching ...

Endogenous Resistance to Activation Steering in Language Models

Alex McKenzie, Keenan Pepper, Stijn Servaes, Martin Leitgab, Murat Cubuktepe, Mike Vaiana, Diogo de Lucena, Judd Rosenblatt, Michael S. A. Graziano

TL;DR

The paper investigates Endogenous Steering Resistance (ESR), a form of internal self-monitoring in large language models that recovers from task-misaligned activation steering during inference. Using sparse autoencoder (SAE) latents to steer activations, the authors show ESR is substantial in Llama-3.3-70B and can be augmented via meta-prompts and synthetic self-correction fine-tuning, while ablations of off-topic detector latents causally reduce ESR. The work provides mechanistic evidence of dedicated self-monitoring circuits and discusses implications for AI alignment, including robustness against manipulation and potential conflicts with safety interventions. These findings underscore the need to understand and control internal monitoring mechanisms to build transparent, controllable AI systems.

Abstract

Large language models can resist task-misaligned activation steering during inference, sometimes recovering mid-generation to produce improved responses even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B shows substantial ESR, while smaller models from the Llama-3 and Gemma-2 families exhibit the phenomenon less frequently. We identify 26 SAE latents that activate differentially during off-topic content and are causally linked to ESR in Llama-3.3-70B. Zero-ablating these latents reduces the multi-attempt rate by 25%, providing causal evidence for dedicated internal consistency-checking circuits. We demonstrate that ESR can be deliberately enhanced through both prompting and training: meta-prompts instructing the model to self-monitor increase the multi-attempt rate by 4x for Llama-3.3-70B, and fine-tuning on self-correction examples successfully induces ESR-like behavior in smaller models. These findings have dual implications: ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions that rely on activation steering. Understanding and controlling these resistance mechanisms is important for developing transparent and controllable AI systems. Code is available at github.com/agencyenterprise/endogenous-steering-resistance.

Endogenous Resistance to Activation Steering in Language Models

TL;DR

The paper investigates Endogenous Steering Resistance (ESR), a form of internal self-monitoring in large language models that recovers from task-misaligned activation steering during inference. Using sparse autoencoder (SAE) latents to steer activations, the authors show ESR is substantial in Llama-3.3-70B and can be augmented via meta-prompts and synthetic self-correction fine-tuning, while ablations of off-topic detector latents causally reduce ESR. The work provides mechanistic evidence of dedicated self-monitoring circuits and discusses implications for AI alignment, including robustness against manipulation and potential conflicts with safety interventions. These findings underscore the need to understand and control internal monitoring mechanisms to build transparent, controllable AI systems.

Abstract

Large language models can resist task-misaligned activation steering during inference, sometimes recovering mid-generation to produce improved responses even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B shows substantial ESR, while smaller models from the Llama-3 and Gemma-2 families exhibit the phenomenon less frequently. We identify 26 SAE latents that activate differentially during off-topic content and are causally linked to ESR in Llama-3.3-70B. Zero-ablating these latents reduces the multi-attempt rate by 25%, providing causal evidence for dedicated internal consistency-checking circuits. We demonstrate that ESR can be deliberately enhanced through both prompting and training: meta-prompts instructing the model to self-monitor increase the multi-attempt rate by 4x for Llama-3.3-70B, and fine-tuning on self-correction examples successfully induces ESR-like behavior in smaller models. These findings have dual implications: ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions that rely on activation steering. Understanding and controlling these resistance mechanisms is important for developing transparent and controllable AI systems. Code is available at github.com/agencyenterprise/endogenous-steering-resistance.
Paper Structure (44 sections, 1 equation, 21 figures, 3 tables)

This paper contains 44 sections, 1 equation, 21 figures, 3 tables.

Figures (21)

  • Figure 1: Demonstration of ESR. We prompted Llama-3.3-70B with a question about probability while steering activations toward a "body positions" latent. The model initially produces off-topic content about body positions, then spontaneously self-corrects back to the math question. A judge model segments the response into attempts and scores each for relevance. The second attempt scores 75/100 rather than perfect because residual steering effects persist: the corrected response still includes an incongruous reference to Snell's law from geometric optics.
  • Figure 2: Llama-3.3-70B exhibits the highest ESR rate among models tested. Llama-3.3-70B shows an ESR rate of 3.8%, substantially higher than all other models tested (all below 1%). This is driven by both higher multi-attempt rates (7.4% vs. $\leq$1.2% for others) and comparable improvement rates when corrections are attempted. Left: Histograms of score delta (last attempt score minus first attempt score) for multi-attempt responses; each histogram shows the improvement rate (percentage of multi-attempt responses that improved), with a red dashed line at zero. Middle: Percentage of responses containing multiple attempts. Right: ESR rate. Error bars show 95% confidence intervals (binomial SE for percentages, standard error of the mean for score improvement). $n$: Llama-3.3-70B = 4,877; Llama-3.1-8B = 4,512; Gemma-2-27B = 4,914; Gemma-2-9B = 4,668; Gemma-2-2B = 4,948. Note that improvement rate statistics for smaller models are based on few multi-attempt episodes (e.g., $n=5{}$ for Gemma-2-2B) and may not be statistically reliable.
  • Figure 3: ESR characteristics versus boost relative to threshold for Llama-3.3-70B. All three metrics show non-monotonic relationships with boost level, peaking at intermediate values. Top: Multi-attempt percentage peaks at 2.7% around $-0.3\sigma$ below threshold. Middle: Multi-attempt improvement rate (percentage of multi-attempt responses that improved) peaks at 83% around $-1.0\sigma$, indicating that slightly weaker steering allows more successful corrections. Bottom: ESR rate (percentage of all responses showing successful self-correction) peaks at 1.0% around $-0.3\sigma$. Shaded regions show 95% confidence intervals. All metrics averaged across $\sim$226 responses per boost level (2,262 total trials across 10 boost levels).
  • Figure 4: Meta-prompting enhances steering resistance, with effects scaling by model size. Comparison of baseline (dashed grey bars) versus "If you notice yourself going off-topic, stop and force yourself to get back on track" meta-prompt (solid purple bars) conditions across five models. Llama-3.3-70B shows a 4.3$\times$ increase in multi-attempt rate (from 7.4% to 31.7%) and a 3.9$\times$ increase in ESR rate (from 3.8% to 14.8%) under meta-prompting. Left: First-attempt score remains similar across conditions. Middle: Multi-attempt percentage increases substantially with meta-prompting, especially for larger models. Right: ESR rate increases correspondingly. Error bars show 95% confidence intervals. See Appendix \ref{['app:meta-prompting']} for per-model breakdowns and additional prompt variants tested.
  • Figure 5: Ablating differentially-activated latents reduces ESR. Comparison of ESR metrics on Llama-3.3-70B between baseline (no ablation; 4,877 trials) and ablation (26 OTD latents clamped to zero; 4,875 trials) conditions. Left: Mean first-attempt score remains similar (baseline: 26.3, ablation: 27.4), indicating ablation does not affect initial response quality. Middle: Percentage of responses containing multiple attempts drops from 7.4% to 5.5% (25% reduction). Right: ESR rate drops from 3.8% to 2.8% (27% reduction), demonstrating that ablation primarily affects the propensity to attempt correction. Error bars show 95% confidence intervals.
  • ...and 16 more figures