SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs
Vincent Siu, Nicholas Crispino, David Park, Nathan W. Henry, Zhun Wang, Yang Liu, Dawn Song, Chenguang Wang
TL;DR
SteeringSafety tackles the problem of unsafe and unpredictable behaviors arising from representation steering in large language models by introducing a modular, training-free framework to evaluate steering across seven safety perspectives on 17 datasets. It formalizes two core metrics—Effectiveness and Entanglement—and implements five steering methods (DIM, ACE, CAA, PCA, LAT) plus conditional steering (CAST) to study their interactions with model type and perspective. Key findings show that effectiveness strongly depends on the combination of method, model, and perspective; social behaviors and normative judgments exhibit the highest entanglement, while reasoning remains comparatively robust, and jailbreaking can have counterintuitive effects on other biases. The framework supports practical guidance on tradeoffs and provides a reusable evaluation platform for safer, more controllable steering in LLMs, with substantial implications for deploying steering techniques in real-world systems.
Abstract
We introduce SteeringSafety, a systematic framework for evaluating representation steering methods across seven safety perspectives spanning 17 datasets. While prior work highlights general capabilities of representation steering, we systematically explore safety perspectives including bias, harmfulness, hallucination, social behaviors, reasoning, epistemic integrity, and normative judgment. Our framework provides modularized building blocks for state-of-the-art steering methods, enabling unified implementation of DIM, ACE, CAA, PCA, and LAT with recent enhancements like conditional steering. Results on Gemma-2-2B, Llama-3.1-8B, and Qwen-2.5-7B reveal that strong steering performance depends critically on pairing of method, model, and specific perspective. DIM shows consistent effectiveness, but all methods exhibit substantial entanglement: social behaviors show highest vulnerability (reaching degradation as high as 76%), jailbreaking often compromises normative judgment, and hallucination steering unpredictably shifts political views. Our findings underscore the critical need for holistic safety evaluations.
