Table of Contents
Fetching ...

MSRS: Adaptive Multi-Subspace Representation Steering for Attribute Alignment in Large Language Models

Xinyan Jiang, Lin Zhang, Jiayi Zhang, Qingsong Yang, Guimin Hu, Di Wang, Lijie Hu

TL;DR

MSRS addresses multi-attribute steering in LLMs by learning disentangled, multi-subspace representations. It builds orthogonal attribute-specific subspaces plus a shared subspace using SVD on attribute activations, with $B_{shared} = V_{c,1:r_s}^\top$ chosen so the cumulative energy in $\Sigma_c$ is at least $60\%$, and private subspaces derived from residuals, forming $S_{align} = [B_{shared}, B_1, ..., B_n]$ to guide alignment. A mask-based adaptive mechanism weights subspace contributions via $m(h) = \text{sigmoid}(\text{MLP}(h))$ and a steering function $\Phi_{l,p}(h; R, W, b, m) = h + R^\top \text{diag}(m(h))(W h + b - R h)$, optimized with a task loss $\mathcal{L}_{task}$ and regularizers $\mathcal{L}_{reg}$ and $\mathcal{L}_{align}$ to promote disentanglement and cross-attribute integration, i.e., $\mathcal{L} = \mathcal{L}_{task} + \lambda_1 \mathcal{L}_{reg} + \lambda_2 \mathcal{L}_{align}$. At inference, a dynamic token selection selects the most relevant token $p_i$ per attribute by evaluating $s_{i,t} = \| \text{proj}_{R_i}(h_t) \|_2$ and steering at $p_i$ with $\Phi_{l,p}(\cdot)$. Empirically, MSRS improves multi-attribute trade-offs across TruthfulQA, BBQ, Alpaca, Refusal, HelpSteer, and standard NLP benchmarks (GLUE, MMLU) across multiple models, while preserving general capabilities, demonstrating robust, scalable, and interpretable control over aligned language generation.

Abstract

Activation steering offers a promising approach to controlling the behavior of Large Language Models by directly manipulating their internal activations. However, most existing methods struggle to jointly steer multiple attributes, often resulting in interference and undesirable trade-offs. To address this challenge, we propose Multi-Subspace Representation Steering (MSRS), a novel framework for effective multi-attribute steering via subspace representation fine-tuning. MSRS reduces inter-attribute interference by allocating orthogonal subspaces to each attribute, isolating their influence within the model's representation space. MSRS also incorporates a hybrid subspace composition strategy: it combines attribute-specific subspaces for unique steering directions with a shared subspace for common steering directions. A dynamic weighting function learns to efficiently integrate these components for precise control. During inference, MSRS introduces a token-level steering mechanism that dynamically identifies and intervenes on the most semantically relevant tokens, enabling fine-grained behavioral modulation. Experimental results show that MSRS significantly reduces attribute conflicts, surpasses existing methods across a range of attributes, and generalizes effectively to diverse downstream tasks.

MSRS: Adaptive Multi-Subspace Representation Steering for Attribute Alignment in Large Language Models

TL;DR

MSRS addresses multi-attribute steering in LLMs by learning disentangled, multi-subspace representations. It builds orthogonal attribute-specific subspaces plus a shared subspace using SVD on attribute activations, with chosen so the cumulative energy in is at least , and private subspaces derived from residuals, forming to guide alignment. A mask-based adaptive mechanism weights subspace contributions via and a steering function , optimized with a task loss and regularizers and to promote disentanglement and cross-attribute integration, i.e., . At inference, a dynamic token selection selects the most relevant token per attribute by evaluating and steering at with . Empirically, MSRS improves multi-attribute trade-offs across TruthfulQA, BBQ, Alpaca, Refusal, HelpSteer, and standard NLP benchmarks (GLUE, MMLU) across multiple models, while preserving general capabilities, demonstrating robust, scalable, and interpretable control over aligned language generation.

Abstract

Activation steering offers a promising approach to controlling the behavior of Large Language Models by directly manipulating their internal activations. However, most existing methods struggle to jointly steer multiple attributes, often resulting in interference and undesirable trade-offs. To address this challenge, we propose Multi-Subspace Representation Steering (MSRS), a novel framework for effective multi-attribute steering via subspace representation fine-tuning. MSRS reduces inter-attribute interference by allocating orthogonal subspaces to each attribute, isolating their influence within the model's representation space. MSRS also incorporates a hybrid subspace composition strategy: it combines attribute-specific subspaces for unique steering directions with a shared subspace for common steering directions. A dynamic weighting function learns to efficiently integrate these components for precise control. During inference, MSRS introduces a token-level steering mechanism that dynamically identifies and intervenes on the most semantically relevant tokens, enabling fine-grained behavioral modulation. Experimental results show that MSRS significantly reduces attribute conflicts, surpasses existing methods across a range of attributes, and generalizes effectively to diverse downstream tasks.

Paper Structure

This paper contains 30 sections, 5 equations, 5 figures, 12 tables, 1 algorithm.

Figures (5)

  • Figure 1: Visualization of MSRS design and comparison with prior work.
  • Figure 2: Comparison of model performance across GLUE.
  • Figure 3: Relationship between performance and shared rank ratio, alongside the explained standard deviation.
  • Figure 4: Token–$R$ similarity vs. performance under different token intervention. Higher similarity correlates with better performance.
  • Figure 5: Performance of interventions at different transformer layers. Mid-layer intervention consistently outperforms others.