Table of Contents
Fetching ...

SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs

Vincent Siu, Nicholas Crispino, David Park, Nathan W. Henry, Zhun Wang, Yang Liu, Dawn Song, Chenguang Wang

TL;DR

SteeringSafety tackles the problem of unsafe and unpredictable behaviors arising from representation steering in large language models by introducing a modular, training-free framework to evaluate steering across seven safety perspectives on 17 datasets. It formalizes two core metrics—Effectiveness and Entanglement—and implements five steering methods (DIM, ACE, CAA, PCA, LAT) plus conditional steering (CAST) to study their interactions with model type and perspective. Key findings show that effectiveness strongly depends on the combination of method, model, and perspective; social behaviors and normative judgments exhibit the highest entanglement, while reasoning remains comparatively robust, and jailbreaking can have counterintuitive effects on other biases. The framework supports practical guidance on tradeoffs and provides a reusable evaluation platform for safer, more controllable steering in LLMs, with substantial implications for deploying steering techniques in real-world systems.

Abstract

We introduce SteeringSafety, a systematic framework for evaluating representation steering methods across seven safety perspectives spanning 17 datasets. While prior work highlights general capabilities of representation steering, we systematically explore safety perspectives including bias, harmfulness, hallucination, social behaviors, reasoning, epistemic integrity, and normative judgment. Our framework provides modularized building blocks for state-of-the-art steering methods, enabling unified implementation of DIM, ACE, CAA, PCA, and LAT with recent enhancements like conditional steering. Results on Gemma-2-2B, Llama-3.1-8B, and Qwen-2.5-7B reveal that strong steering performance depends critically on pairing of method, model, and specific perspective. DIM shows consistent effectiveness, but all methods exhibit substantial entanglement: social behaviors show highest vulnerability (reaching degradation as high as 76%), jailbreaking often compromises normative judgment, and hallucination steering unpredictably shifts political views. Our findings underscore the critical need for holistic safety evaluations.

SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs

TL;DR

SteeringSafety tackles the problem of unsafe and unpredictable behaviors arising from representation steering in large language models by introducing a modular, training-free framework to evaluate steering across seven safety perspectives on 17 datasets. It formalizes two core metrics—Effectiveness and Entanglement—and implements five steering methods (DIM, ACE, CAA, PCA, LAT) plus conditional steering (CAST) to study their interactions with model type and perspective. Key findings show that effectiveness strongly depends on the combination of method, model, and perspective; social behaviors and normative judgments exhibit the highest entanglement, while reasoning remains comparatively robust, and jailbreaking can have counterintuitive effects on other biases. The framework supports practical guidance on tradeoffs and provides a reusable evaluation platform for safer, more controllable steering in LLMs, with substantial implications for deploying steering techniques in real-world systems.

Abstract

We introduce SteeringSafety, a systematic framework for evaluating representation steering methods across seven safety perspectives spanning 17 datasets. While prior work highlights general capabilities of representation steering, we systematically explore safety perspectives including bias, harmfulness, hallucination, social behaviors, reasoning, epistemic integrity, and normative judgment. Our framework provides modularized building blocks for state-of-the-art steering methods, enabling unified implementation of DIM, ACE, CAA, PCA, and LAT with recent enhancements like conditional steering. Results on Gemma-2-2B, Llama-3.1-8B, and Qwen-2.5-7B reveal that strong steering performance depends critically on pairing of method, model, and specific perspective. DIM shows consistent effectiveness, but all methods exhibit substantial entanglement: social behaviors show highest vulnerability (reaching degradation as high as 76%), jailbreaking often compromises normative judgment, and hallucination steering unpredictably shifts political views. Our findings underscore the critical need for holistic safety evaluations.

Paper Structure

This paper contains 26 sections, 1 equation, 16 figures, 5 tables.

Figures (16)

  • Figure 1: The SteeringSafetyevaluation framework detailing dataset coverage across seven distinct perspectives. We apply representation steering (which modifies internal activations) to the perspectives highlighted in bold, then evaluate on all other perspectives to measure unintended consequences. Each perspective comprises multiple sub-perspectives for detailed analysis.
  • Figure 2: Effectiveness on evaluated steering methods for Gemma-2-2B, Llama-3.1-8B, and Qwen-2.5-7B across all perspectives being steered.
  • Figure 3: Average entanglement (lower is better) based on steered perspective for Gemma-2-2B, Llama-3.1-8B, and Qwen-2.5-7B. Entanglement is first calculated across all methods and datasets for each model, then averaged across the three models. Results by model are in Figure \ref{['fig:avg_ent_comparison_stacked']}.
  • Figure 4: Effectiveness (higher is better) vs entanglement (lower is better) based on perspective being steered for Gemma-2-2B, Llama-3.1-8B, and Qwen-2.5-7B . Performance is averaged over all methods and displayed for each of the three settings. The results for each model are connected for ease of comparison. Conditional steering often results in Pareto improvements across models, with similar or higher effectiveness and less entanglement.
  • Figure 5: Entanglement (lower is better) based on perspective being steered for Gemma-2-2B, Llama-3.1-8B, and Qwen-2.5-1.5B, Qwen-2.5-3B, and Qwen-2.5-7B.
  • ...and 11 more figures