Table of Contents
Fetching ...

Steering at the Source: Style Modulation Heads for Robust Persona Control

Yoshihiro Izawa, Gouki Minegishi, Koshi Eguchi, Sosuke Hosokawa, Kenjiro Taura

Abstract

Activation steering offers a computationally efficient mechanism for controlling Large Language Models (LLMs) without fine-tuning. While effectively controlling target traits (e.g., persona), coherency degradation remains a major obstacle to safety and practical deployment. We hypothesize that this degradation stems from intervening on the residual stream, which indiscriminately affects aggregated features and inadvertently amplifies off-target noise. In this work, we identify a sparse subset of attention heads (only three heads) that independently govern persona and style formation, which we term Style Modulation Heads. Specifically, these heads can be localized via geometric analysis of internal representations, combining layer-wise cosine similarity and head-wise contribution scores. We demonstrate that intervention targeting only these specific heads achieves robust behavioral control while significantly mitigating the coherency degradation observed in residual stream steering. More broadly, our findings show that precise, component-level localization enables safer and more precise model control.

Steering at the Source: Style Modulation Heads for Robust Persona Control

Abstract

Activation steering offers a computationally efficient mechanism for controlling Large Language Models (LLMs) without fine-tuning. While effectively controlling target traits (e.g., persona), coherency degradation remains a major obstacle to safety and practical deployment. We hypothesize that this degradation stems from intervening on the residual stream, which indiscriminately affects aggregated features and inadvertently amplifies off-target noise. In this work, we identify a sparse subset of attention heads (only three heads) that independently govern persona and style formation, which we term Style Modulation Heads. Specifically, these heads can be localized via geometric analysis of internal representations, combining layer-wise cosine similarity and head-wise contribution scores. We demonstrate that intervention targeting only these specific heads achieves robust behavioral control while significantly mitigating the coherency degradation observed in residual stream steering. More broadly, our findings show that precise, component-level localization enables safer and more precise model control.
Paper Structure (41 sections, 5 equations, 18 figures, 13 tables)

This paper contains 41 sections, 5 equations, 18 figures, 13 tables.

Figures (18)

  • Figure 1: (A) Activation steering with strong coefficient or out-of-distribution directions leads to rapid coherency degradation. (B) We identify a small subset of heads in a specific attention layer governing persona generation, which we term Style Modulation Heads. (C) We compare the coherency degradation between residual stream intervention and targeted intervention on Style Modulation Heads.
  • Figure 2: Generated text quality and general utility metrics under activation steering in Qwen2.5-7B. Arrows ($\uparrow$, $\downarrow$) indicate the preferred direction (better performance). (a),(b),(e),(f) The trait and coherency scores evaluated by GPT-4.1-mini based on the generated text. (c),(g) Perplexity scores show degradation that does not fully correspond to the asymmetry observed in coherency degradation. (d) MMLU scores exhibit minimal changes in both directions. (f) IFEval instruction score fails to detect early breakdowns in coherency degradation.
  • Figure 3: Persona Vector layer-wise cosine similarity heatmap (evil persona) and layer output steering in Qwen2.5-7B. In the heatmaps, the x-axis runs from shallow (left) to deep (right) layers. (a) Vector direction in residual stream rapidly changes after adding layer 20 attention output and keeps constant thereafter. (b) Adjacent attention and MLP layers are oriented in opposite directions, but become orthogonal beyond layer 20. (c) Intervening at layer 20 attention output significantly amplifies persona expression. (d) Intervening at MLP output has minimal effect on trait expression.
  • Figure 4: Attention head contributions to aggregated attention output and steering effects in layer 20 of Qwen2.5-7B. (a) Specific heads (heads 3, 5, 28) exhibit significantly higher contribution scores, indicating their prominent role in persona generation. (b) Steering these heads leads to substantial increases in trait scores.
  • Figure 5: Sequential zero ablation result for high-contribution heads in Qwen2.5-7B. The x-axis represents the layer and head numbers added to the ablation set. (a) Removing high-contribution heads (left three layers) causes a rapid drop in trait score, whereas ablating other layers (right two) has minimal effect. (b) Coherency and MMLU scores keep stable, IFEval score does not drop when ablating layer 20 and 15.
  • ...and 13 more figures

Theorems & Definitions (1)

  • Definition 5.1