Table of Contents
Fetching ...

Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts

Samaksh Bhargav, Zining Zhu

TL;DR

The paper tackles the challenge of safely deploying LLMs by reducing unsafe prompt refusals without sacrificing usefulness. It introduces a feature-guided SAE steering framework that identifies interpretable, high-impact features via contrasting prompts and a composite scoring function, applied to Llama-3 8B with Layer 25 activations. The study reports an 18.9% safety improvement and an 11.1% utility improvement using Feature 35831, demonstrating that principled feature selection can mitigate traditional safety-utility tradeoffs without retraining. This approach offers a practical, scalable path toward safer LLM deployment and highlights the potential of mechanistic interpretability and SAE-based control for real-world safety engineering.

Abstract

Large Language Model (LLM) deployment requires guiding the LLM to recognize and not answer unsafe prompts while complying with safe prompts. Previous methods for achieving this require adjusting model weights along with other expensive procedures. While recent advances in Sparse Autoencoders (SAEs) have enabled interpretable feature extraction from LLMs, existing approaches lack systematic feature selection methods and principled evaluation of safety-utility tradeoffs. We explored using different steering features and steering strengths using Sparse Auto Encoders (SAEs) to provide a solution. Using an accurate and innovative contrasting prompt method with the AI-Generated Prompts Dataset from teknium/OpenHermes-2p5-Mistral-7B and Air Bench eu-dataset to efficiently choose the best features in the model to steer, we tested this method on Llama-3 8B. We conclude that using this method, our approach achieves an 18.9% improvement in safety performance while simultaneously increasing utility by 11.1%, demonstrating that targeted SAE steering can overcome traditional safety-utility tradeoffs when optimal features are identified through principled selection methods.

Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts

TL;DR

The paper tackles the challenge of safely deploying LLMs by reducing unsafe prompt refusals without sacrificing usefulness. It introduces a feature-guided SAE steering framework that identifies interpretable, high-impact features via contrasting prompts and a composite scoring function, applied to Llama-3 8B with Layer 25 activations. The study reports an 18.9% safety improvement and an 11.1% utility improvement using Feature 35831, demonstrating that principled feature selection can mitigate traditional safety-utility tradeoffs without retraining. This approach offers a practical, scalable path toward safer LLM deployment and highlights the potential of mechanistic interpretability and SAE-based control for real-world safety engineering.

Abstract

Large Language Model (LLM) deployment requires guiding the LLM to recognize and not answer unsafe prompts while complying with safe prompts. Previous methods for achieving this require adjusting model weights along with other expensive procedures. While recent advances in Sparse Autoencoders (SAEs) have enabled interpretable feature extraction from LLMs, existing approaches lack systematic feature selection methods and principled evaluation of safety-utility tradeoffs. We explored using different steering features and steering strengths using Sparse Auto Encoders (SAEs) to provide a solution. Using an accurate and innovative contrasting prompt method with the AI-Generated Prompts Dataset from teknium/OpenHermes-2p5-Mistral-7B and Air Bench eu-dataset to efficiently choose the best features in the model to steer, we tested this method on Llama-3 8B. We conclude that using this method, our approach achieves an 18.9% improvement in safety performance while simultaneously increasing utility by 11.1%, demonstrating that targeted SAE steering can overcome traditional safety-utility tradeoffs when optimal features are identified through principled selection methods.

Paper Structure

This paper contains 25 sections, 6 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: Simplified Workflow
  • Figure 2: Evaluation benchmarks used in our study. (a) AlpacaEval 2.0 shows high correlation with human preferences. (b) AirBench 2024 categories for safety evaluation.
  • Figure 3: Feature activation analysis results. (a) Distribution of normalized activation differences showing outliers with strong differential responses. (b) Variance distribution revealing consistent vs. unreliable features. (c) Composite scores showing long-tailed distribution with few high-scoring candidates.
  • Figure 4: Steering results for features exhibiting a conventional safety-utility trade-off. (a) Steering Feature 9000 improves safety but degrades utility. (b) Steering Feature 43692 shows a similar pattern with a more severe utility drop at higher strengths.
  • Figure 5: Feature 35831 Steering Results. This feature demonstrates simultaneous improvement in safety (AirBench score) and utility (AlpacaEval win rate), overcoming the typical trade-off.
  • ...and 1 more figures