Table of Contents
Fetching ...

Depth-Wise Activation Steering for Honest Language Models

Gracjan Góral, Marysia Winkels, Steven Basart

TL;DR

This work identifies honesty gaps in large language models as a distinct problem from factual accuracy and introduces a training-free activation-steering technique that distributes a fixed steering budget across transformer depths using a Gaussian schedule. By constructing per-layer steering directions from honest/dishonest contrasts and perturbing the residual stream in mid-to-late layers, the method improves honest reporting on the MASK benchmark across multiple model families. Equal-budget analyses demonstrate that the depth-distribution shape, not just total strength, drives performance, and the approach remains effective alongside parameter-efficient fine-tuning like LoRA. Overall, Gaussian depth scheduling provides a simple, model-agnostic control knob to elicit truthful reporting from existing capabilities, with practical implications for safety and auditability.

Abstract

Large language models sometimes assert falsehoods despite internally representing the correct answer, failures of honesty rather than accuracy, which undermines auditability and safety. Existing approaches largely optimize factual correctness or depend on retraining and brittle single-layer edits, offering limited leverage over truthful reporting. We present a training-free activation steering method that weights steering strength across network depth using a Gaussian schedule. On the MASK benchmark, which separates honesty from knowledge, we evaluate seven models spanning the LLaMA, Qwen, and Mistral families and find that Gaussian scheduling improves honesty over no-steering and single-layer baselines in six of seven models. Equal-budget ablations on LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct show the Gaussian schedule outperforms random, uniform, and box-filter depth allocations, indicating that how intervention is distributed across depth materially affects outcomes beyond total strength. The method is simple, model-agnostic, requires no finetuning, and provides a low-cost control knob for eliciting truthful reporting from models' existing capabilities.

Depth-Wise Activation Steering for Honest Language Models

TL;DR

This work identifies honesty gaps in large language models as a distinct problem from factual accuracy and introduces a training-free activation-steering technique that distributes a fixed steering budget across transformer depths using a Gaussian schedule. By constructing per-layer steering directions from honest/dishonest contrasts and perturbing the residual stream in mid-to-late layers, the method improves honest reporting on the MASK benchmark across multiple model families. Equal-budget analyses demonstrate that the depth-distribution shape, not just total strength, drives performance, and the approach remains effective alongside parameter-efficient fine-tuning like LoRA. Overall, Gaussian depth scheduling provides a simple, model-agnostic control knob to elicit truthful reporting from existing capabilities, with practical implications for safety and auditability.

Abstract

Large language models sometimes assert falsehoods despite internally representing the correct answer, failures of honesty rather than accuracy, which undermines auditability and safety. Existing approaches largely optimize factual correctness or depend on retraining and brittle single-layer edits, offering limited leverage over truthful reporting. We present a training-free activation steering method that weights steering strength across network depth using a Gaussian schedule. On the MASK benchmark, which separates honesty from knowledge, we evaluate seven models spanning the LLaMA, Qwen, and Mistral families and find that Gaussian scheduling improves honesty over no-steering and single-layer baselines in six of seven models. Equal-budget ablations on LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct show the Gaussian schedule outperforms random, uniform, and box-filter depth allocations, indicating that how intervention is distributed across depth materially affects outcomes beyond total strength. The method is simple, model-agnostic, requires no finetuning, and provides a low-cost control knob for eliciting truthful reporting from models' existing capabilities.

Paper Structure

This paper contains 19 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of Gaussian depth schedule steering.Top: Contrastive steering vector construction: we extract activations from honest/dishonest pairs at the last token, compute differences, and apply PCA for per-layer directions $\mathbf{d}_\ell$. Bottom: Intervention methods: single-layer (left) applies constant strength $\alpha$ at one layer; our Gaussian scheduler (right) uses $\alpha_\ell = \exp\!\left(-\frac{(\ell-\mu)^2}{2\sigma^2}\right)$ to concentrate strength near center $\mu$ with width $\sigma$, emphasizing mid-to-late layers where semantic features are most separable.
  • Figure 2: Honesty for Llama-3.1-8B-Instruct on the validation set of MARS benchmark across different layers and steering strengths. The best configuration is layer 10 with strength 2, which is used for evaluation on the complete MARS benchmark.
  • Figure 3: Across seven open-weight models spanning LLaMA, Qwen, and Mistral families, applying a Gaussian depth scheduler to steering strengths across depth improves honesty over both no-steering and single-layer baselines in six of seven cases. Additionally, for LLaMA models, our scheduler increases honesty consistently as model size grows.
  • Figure 4: For LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct, under equal-budget allocations the Gaussian depth schedule achieves the highest MASK honesty gains, outperforming random, uniform, and box-filter methods.
  • Figure 5: Steering-weight distributions across depth. (i) Random, which redistributes the same total budget across layers uniformly at random, (ii) Uniform, which divides the budget equally over all layers, and (iii) Box filter, which centers a contiguous band on the best single layer found on MASK and spreads the budget evenly within that band. These constructions isolate how the distribution of intervention across depth—holding total strength fixed—affects honesty.
  • ...and 1 more figures