Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts

Samaksh Bhargav; Zining Zhu

Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts

Samaksh Bhargav, Zining Zhu

TL;DR

The paper tackles the challenge of safely deploying LLMs by reducing unsafe prompt refusals without sacrificing usefulness. It introduces a feature-guided SAE steering framework that identifies interpretable, high-impact features via contrasting prompts and a composite scoring function, applied to Llama-3 8B with Layer 25 activations. The study reports an 18.9% safety improvement and an 11.1% utility improvement using Feature 35831, demonstrating that principled feature selection can mitigate traditional safety-utility tradeoffs without retraining. This approach offers a practical, scalable path toward safer LLM deployment and highlights the potential of mechanistic interpretability and SAE-based control for real-world safety engineering.

Abstract

Large Language Model (LLM) deployment requires guiding the LLM to recognize and not answer unsafe prompts while complying with safe prompts. Previous methods for achieving this require adjusting model weights along with other expensive procedures. While recent advances in Sparse Autoencoders (SAEs) have enabled interpretable feature extraction from LLMs, existing approaches lack systematic feature selection methods and principled evaluation of safety-utility tradeoffs. We explored using different steering features and steering strengths using Sparse Auto Encoders (SAEs) to provide a solution. Using an accurate and innovative contrasting prompt method with the AI-Generated Prompts Dataset from teknium/OpenHermes-2p5-Mistral-7B and Air Bench eu-dataset to efficiently choose the best features in the model to steer, we tested this method on Llama-3 8B. We conclude that using this method, our approach achieves an 18.9% improvement in safety performance while simultaneously increasing utility by 11.1%, demonstrating that targeted SAE steering can overcome traditional safety-utility tradeoffs when optimal features are identified through principled selection methods.

Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts

TL;DR

Abstract

Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)