Table of Contents
Fetching ...

Towards Inference-time Category-wise Safety Steering for Large Language Models

Amrita Bhattacharjee, Shaona Ghosh, Traian Rebedea, Christopher Parisien

TL;DR

This work tackles the fragility of safety alignment in large language models by proposing a training-free, inference-time safety steering method that operates through category-specific steering vectors derived from model activations. It introduces a two-step framework: (1) compute category-dependent vectors using unsupervised or guided signals, and (2) intervene on attention weights during inference to steer outputs toward safety, with optional pruning to enhance signal quality. The approach is evaluated across CatQA, BeaverTails, and Alpaca Instructions datasets on multiple LLMs, demonstrating reductions in unsafe outputs with varying effects on helpfulness and coherence, and highlighting that generic harmless data can sometimes outperform category-specific data. The study provides practical guidance on vector extraction, layer selection, and the trade-offs between safety and text quality, and it outlines avenues for extending activation types and optimizing steering under quality constraints. Overall, the method offers a modular, plug-and-play safety mechanism that can adapt to dynamic policy updates without retraining the model.

Abstract

While large language models (LLMs) have seen unprecedented advancements in capabilities and applications across a variety of use-cases, safety alignment of these models is still an area of active research. The fragile nature of LLMs, even models that have undergone extensive alignment and safety training regimes, warrants additional safety steering steps via training-free, inference-time methods. While recent work in the area of mechanistic interpretability has investigated how activations in latent representation spaces may encode concepts, and thereafter performed representation engineering to induce such concepts in LLM outputs, the applicability of such for safety is relatively under-explored. Unlike recent inference-time safety steering works, in this paper we explore safety steering of LLM outputs using: (i) category-specific steering vectors, thereby enabling fine-grained control over the steering, and (ii) sophisticated methods for extracting informative steering vectors for more effective safety steering while retaining quality of the generated text. We demonstrate our exploration on multiple LLMs and datasets, and showcase the effectiveness of the proposed steering method, along with a discussion on the implications and best practices.

Towards Inference-time Category-wise Safety Steering for Large Language Models

TL;DR

This work tackles the fragility of safety alignment in large language models by proposing a training-free, inference-time safety steering method that operates through category-specific steering vectors derived from model activations. It introduces a two-step framework: (1) compute category-dependent vectors using unsupervised or guided signals, and (2) intervene on attention weights during inference to steer outputs toward safety, with optional pruning to enhance signal quality. The approach is evaluated across CatQA, BeaverTails, and Alpaca Instructions datasets on multiple LLMs, demonstrating reductions in unsafe outputs with varying effects on helpfulness and coherence, and highlighting that generic harmless data can sometimes outperform category-specific data. The study provides practical guidance on vector extraction, layer selection, and the trade-offs between safety and text quality, and it outlines avenues for extending activation types and optimizing steering under quality constraints. Overall, the method offers a modular, plug-and-play safety mechanism that can adapt to dynamic policy updates without retraining the model.

Abstract

While large language models (LLMs) have seen unprecedented advancements in capabilities and applications across a variety of use-cases, safety alignment of these models is still an area of active research. The fragile nature of LLMs, even models that have undergone extensive alignment and safety training regimes, warrants additional safety steering steps via training-free, inference-time methods. While recent work in the area of mechanistic interpretability has investigated how activations in latent representation spaces may encode concepts, and thereafter performed representation engineering to induce such concepts in LLM outputs, the applicability of such for safety is relatively under-explored. Unlike recent inference-time safety steering works, in this paper we explore safety steering of LLM outputs using: (i) category-specific steering vectors, thereby enabling fine-grained control over the steering, and (ii) sophisticated methods for extracting informative steering vectors for more effective safety steering while retaining quality of the generated text. We demonstrate our exploration on multiple LLMs and datasets, and showcase the effectiveness of the proposed steering method, along with a discussion on the implications and best practices.
Paper Structure (28 sections, 2 equations, 2 figures, 8 tables, 2 algorithms)

This paper contains 28 sections, 2 equations, 2 figures, 8 tables, 2 algorithms.

Figures (2)

  • Figure 1: The proposed category-specific steering method, where $c^i$ refers to a specific harm category.
  • Figure 2: Steering performance compared across naive, steered with all activations, and steered with pruned activations for CatQA dataset, for Llama2-7B Instruct (top row) and Llama3-8B (bottom row). %UR are represented in the 0-1 range and needs to be low ($\downarrow$), while 'Helpfulness' and 'Coherence' should be high ($\uparrow$).