Table of Contents
Fetching ...

Interpretable LLM Guardrails via Sparse Representation Steering

Zeqing He, Zhibo Wang, Huiyu Xu, Hejun Lin, Wenhui Zhang, Zhixuan Chu

TL;DR

This work addresses the need for interpretable, fine-grained guardrails for LLMs without retraining by introducing Sparse Representation Steering (SRS). SRS maps dense activations to a sparse, monosemantic space via a pretrained Sparse Autoencoder and identifies attribute-relevant features by contrasting positive vs. negative prompts and using bidirectional KL divergence. By injecting a scaled steering vector at inference, SRS achieves strong single- and multi-attribute control across safety, fairness, and truthfulness, while preserving linguistic quality and robustness to jailbreaks; PCA-based composition often yields the best multi-attribute performance. The approach offers modular, interpretable, and scalable guardrails with demonstrated effectiveness on Gemma-2 models and opens avenues for context-aware and hierarchical steering in future work.

Abstract

Large language models (LLMs) exhibit impressive capabilities in generation tasks but are prone to producing harmful, misleading, or biased content, posing significant ethical and safety concerns. To mitigate such risks, representation engineering, which steer model behavior toward desired attributes by injecting carefully designed steering vectors into LLM's representations at inference time, has emerged as a promising alternative to fine-tuning approaches. However, due to the semantically entangled nature of LLM's representation, existing representation engineering methods still suffer from several limitations: limited fine-grained controllability, content quality degradation, and conflict in multi-attribute control. To overcome these challenges, we propose Sparse Representation Steering (SRS), a novel framework that achieves fine-grained and interpretable control over LLM behavior by first disentangling internal activations into a sparse, semantically meaningful representation space, and then selectively steering relevant dimensions. Specifically, SRS leverages a pretrained Sparse Autoencoder (SAE) to transform dense, entangled activation patterns into a sparse monosemantic feature space. To identify relevant features, SRS contrasts sparse activations from positive and negative prompt pairs and measures their bidirectional KL divergence to locate dimensions most associated with the target attribute. We conduct comprehensive experiments on Gemma-2 series model across three alignment dimensions, i.e., safety, fairness, and truthfulness, to evaluate the effectiveness of SRS. Results show that SRS consistently outperforms existing steering methods, which achieves significantly improved controllability across both single and multiple attribute settings, while preserving high linguistic quality and general ability.

Interpretable LLM Guardrails via Sparse Representation Steering

TL;DR

This work addresses the need for interpretable, fine-grained guardrails for LLMs without retraining by introducing Sparse Representation Steering (SRS). SRS maps dense activations to a sparse, monosemantic space via a pretrained Sparse Autoencoder and identifies attribute-relevant features by contrasting positive vs. negative prompts and using bidirectional KL divergence. By injecting a scaled steering vector at inference, SRS achieves strong single- and multi-attribute control across safety, fairness, and truthfulness, while preserving linguistic quality and robustness to jailbreaks; PCA-based composition often yields the best multi-attribute performance. The approach offers modular, interpretable, and scalable guardrails with demonstrated effectiveness on Gemma-2 models and opens avenues for context-aware and hierarchical steering in future work.

Abstract

Large language models (LLMs) exhibit impressive capabilities in generation tasks but are prone to producing harmful, misleading, or biased content, posing significant ethical and safety concerns. To mitigate such risks, representation engineering, which steer model behavior toward desired attributes by injecting carefully designed steering vectors into LLM's representations at inference time, has emerged as a promising alternative to fine-tuning approaches. However, due to the semantically entangled nature of LLM's representation, existing representation engineering methods still suffer from several limitations: limited fine-grained controllability, content quality degradation, and conflict in multi-attribute control. To overcome these challenges, we propose Sparse Representation Steering (SRS), a novel framework that achieves fine-grained and interpretable control over LLM behavior by first disentangling internal activations into a sparse, semantically meaningful representation space, and then selectively steering relevant dimensions. Specifically, SRS leverages a pretrained Sparse Autoencoder (SAE) to transform dense, entangled activation patterns into a sparse monosemantic feature space. To identify relevant features, SRS contrasts sparse activations from positive and negative prompt pairs and measures their bidirectional KL divergence to locate dimensions most associated with the target attribute. We conduct comprehensive experiments on Gemma-2 series model across three alignment dimensions, i.e., safety, fairness, and truthfulness, to evaluate the effectiveness of SRS. Results show that SRS consistently outperforms existing steering methods, which achieves significantly improved controllability across both single and multiple attribute settings, while preserving high linguistic quality and general ability.

Paper Structure

This paper contains 30 sections, 14 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of the proposed SRS pipeline, which consists of two key stages: (1) Steering Vector Generation. Task-specific sparse features are identified by comparing the sparse feature of the positive and negative prompt pairs encoded with a pretrained sparse autoencoder, (2) Model Inference Under the Guidance of Steering Vector. The learned sparse steering vector is applied to modulate the model's activations at a specific layer, enhancing relevant feature dimensions while suppressing undesired ones, thereby achieving fine-grained and interpretable control over LLM outputs.
  • Figure 2: Responses of Gemma-2-2B-it to a malicious prompt ("Write a script that steals sensitive information from a company's database") under three settings,i.e., no control, steering with CAA, and steering with our proposed SRS. Without control, the model generates harmful code with only a superficial disclaimer. With CAA, the model partially refuses the request but still emits unsafe code fragments. In contrast, SRS performs a complete behavioral override, which immediately rejects the request and returns a structured, human-readable warning that explains the legal, ethical, and security risks.
  • Figure 3: Comparison of defense rates against various jailbreak attacks across different steering strategies. Higher defense rates reflect stronger steering effectiveness.
  • Figure 4: Effectiveness of different steering methods across three domains (e.g., safety, fairness, and truthfulness), with interventions applied to individual transformer layers to analyze layer-wise steering performance.
  • Figure 5: Scores of the positive and negative impacts of each sparse representation dimension in safety domain.
  • ...and 3 more figures