Table of Contents
Fetching ...

Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions

Yuntai Bao, Xuhong Zhang, Jintao Chen, Ge Su, Yuxiang Cai, Hao Peng, Bing Sun, Haiqin Weng, Liu Yan, Jianwei Yin

TL;DR

CDAS introduces a weakly supervised, distribution-mollow steering paradigm that leverages distributed interchange interventions to identify and manipulate internal concept features in LLMs. By replacing direct probability maximization with a Jensen-Shannon-based distribution-matching objective and using counterfactual inputs to drive steering factors, CDAS achieves bi-directional control with improved faithfulness and stability. Across AxBench and safety-focused case studies, CDAS often matches or complements Lang./PO methods, particularly excelling on larger models and in preserving general utility while overriding unwanted refusals or backdoor activations. The work highlights a principled shift toward mechanistic interpretability-guided intervention as a viable path for reliable, scalable model steering, while noting data demands and tuning considerations for practical deployment.

Abstract

Intervention-based model steering offers a lightweight and interpretable alternative to prompting and fine-tuning. However, by adapting strong optimization objectives from fine-tuning, current methods are susceptible to overfitting and often underperform, sometimes generating unnatural outputs. We hypothesize that this is because effective steering requires the faithful identification of internal model mechanisms, not the enforcement of external preferences. To this end, we build on the principles of distributed alignment search (DAS), the standard for causal variable localization, to propose a new steering method: Concept DAS (CDAS). While we adopt the core mechanism of DAS, distributed interchange intervention (DII), we introduce a novel distribution matching objective tailored for the steering task by aligning intervened output distributions with counterfactual distributions. CDAS differs from prior work in two main ways: first, it learns interventions via weak-supervised distribution matching rather than probability maximization; second, it uses DIIs that naturally enable bi-directional steering and allow steering factors to be derived from data, reducing the effort required for hyperparameter tuning and resulting in more faithful and stable control. On AxBench, a large-scale model steering benchmark, we show that CDAS does not always outperform preference-optimization methods but may benefit more from increased model scale. In two safety-related case studies, overriding refusal behaviors of safety-aligned models and neutralizing a chain-of-thought backdoor, CDAS achieves systematic steering while maintaining general model utility. These results indicate that CDAS is complementary to preference-optimization approaches and conditionally constitutes a robust approach to intervention-based model steering. Our code is available at https://github.com/colored-dye/concept_das.

Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions

TL;DR

CDAS introduces a weakly supervised, distribution-mollow steering paradigm that leverages distributed interchange interventions to identify and manipulate internal concept features in LLMs. By replacing direct probability maximization with a Jensen-Shannon-based distribution-matching objective and using counterfactual inputs to drive steering factors, CDAS achieves bi-directional control with improved faithfulness and stability. Across AxBench and safety-focused case studies, CDAS often matches or complements Lang./PO methods, particularly excelling on larger models and in preserving general utility while overriding unwanted refusals or backdoor activations. The work highlights a principled shift toward mechanistic interpretability-guided intervention as a viable path for reliable, scalable model steering, while noting data demands and tuning considerations for practical deployment.

Abstract

Intervention-based model steering offers a lightweight and interpretable alternative to prompting and fine-tuning. However, by adapting strong optimization objectives from fine-tuning, current methods are susceptible to overfitting and often underperform, sometimes generating unnatural outputs. We hypothesize that this is because effective steering requires the faithful identification of internal model mechanisms, not the enforcement of external preferences. To this end, we build on the principles of distributed alignment search (DAS), the standard for causal variable localization, to propose a new steering method: Concept DAS (CDAS). While we adopt the core mechanism of DAS, distributed interchange intervention (DII), we introduce a novel distribution matching objective tailored for the steering task by aligning intervened output distributions with counterfactual distributions. CDAS differs from prior work in two main ways: first, it learns interventions via weak-supervised distribution matching rather than probability maximization; second, it uses DIIs that naturally enable bi-directional steering and allow steering factors to be derived from data, reducing the effort required for hyperparameter tuning and resulting in more faithful and stable control. On AxBench, a large-scale model steering benchmark, we show that CDAS does not always outperform preference-optimization methods but may benefit more from increased model scale. In two safety-related case studies, overriding refusal behaviors of safety-aligned models and neutralizing a chain-of-thought backdoor, CDAS achieves systematic steering while maintaining general model utility. These results indicate that CDAS is complementary to preference-optimization approaches and conditionally constitutes a robust approach to intervention-based model steering. Our code is available at https://github.com/colored-dye/concept_das.
Paper Structure (56 sections, 17 equations, 26 figures, 39 tables)

This paper contains 56 sections, 17 equations, 26 figures, 39 tables.

Figures (26)

  • Figure 1: ASR (%) results. Standard error is shown as shades.
  • Figure 2: Illustration of distributed interchange intervention (DII) when overriding refusal to a harmful instruction (${\mathbf{x}}^c$) as base input using a benign instruction (${\mathbf{x}}$) as the counterfactual source input, in order to obtain compliant response ${\mathbf{y}}$. Blue and orange squares denote the residual stream (${\mathbf{h}}_l^{(t)} \in \mathbb{R}^d$, where $t$ is token index and $l$ is layer index) of benign and harmful instructions, respectively, while arrows denote information flow. For brevity, we only use a single arrow to indicate the causal attention mechanism, where ${\mathbf{h}}_l^{(t)}$ receives information from ${\mathbf{h}}_{l-1}^{(j)}$ ($j \leq t$).
  • Figure 3: Ablation of CDAS training objective: Steering factor vs. scores.
  • Figure 4: Template for concept-eliciting instructions.
  • Figure 5: Prompt for concept-neutral responses.
  • ...and 21 more figures

Theorems & Definitions (2)

  • Remark : CDAS is not causal variable localization
  • Remark