Table of Contents
Fetching ...

Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training

Yein Park, Minbyul Jeong, Jaewoo Kang

TL;DR

The paper investigates how post-training paradigms reshape internal reasoning mechanics in large reasoning models by revealing emergent attention-head circuits through mechanistic, circuit-level analysis. It shows that distillation and supervised fine-tuning (SFT) steadily add new, stable heads—often in middle-to-late layers—while group relative policy optimization (GRPO) drives a dynamic, reward-guided exploration and pruning of heads. The Think On/Off framework reveals that explicit reasoning gating does not create dedicated thinking heads; instead, disabling think triggers a broad compensatory set of heads, linking gating to efficient computation and robustness. Overall, the work highlights a tension between developing sophisticated, structured reasoning and maintaining reliable execution, and it advocates head-aware training policies that balance exploration, precision, and calculational reliability for robust reasoning systems.

Abstract

The remarkable capabilities of modern large reasoning models are largely unlocked through post-training techniques such as supervised fine-tuning and reinforcement learning. However, the architectural mechanisms behind such improvements remain largely opaque. In this work, we use circuit analysis to demonstrate that post-training for complex reasoning sparks the emergence of novel, functionally specialized attention heads. These heads collectively support structured reasoning and computation. Our comparative analysis across Qwen families and DeepSeek-distilled model reveals that these emergent heads evolve differently under different training regimes. Distillation and SFT foster a cumulative addition of stable reasoning heads. In contrast, group relative policy optimization operates in a dynamic search mode: relatively few attention heads are iteratively activated, evaluated, and pruned, with their survival closely tracking fluctuations in the task reward signal. Furthermore, we find that controllable think on/off models do not possess dedicated thinking heads. Instead, turning off explicit reasoning triggers a broader-but less efficient-set of compensatory heads. Through ablation and qualitative analyses, we connect these circuit-level dynamics to a crucial performance trade-off: strengthened heads enable sophisticated problem-solving strategies for difficult problems but can also introduce over-thinking failure modes, such as calculation errors or logical loops on simpler tasks. These findings connect circuit-level dynamics to macro-level performance, identifying an inherent tension where complex reasoning comes at the cost of elementary computations. More broadly, our work points to future directions for training policy design, emphasizing the need to balance the development of effective reasoning strategies with the assurance of reliable, flawless execution.

Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training

TL;DR

The paper investigates how post-training paradigms reshape internal reasoning mechanics in large reasoning models by revealing emergent attention-head circuits through mechanistic, circuit-level analysis. It shows that distillation and supervised fine-tuning (SFT) steadily add new, stable heads—often in middle-to-late layers—while group relative policy optimization (GRPO) drives a dynamic, reward-guided exploration and pruning of heads. The Think On/Off framework reveals that explicit reasoning gating does not create dedicated thinking heads; instead, disabling think triggers a broad compensatory set of heads, linking gating to efficient computation and robustness. Overall, the work highlights a tension between developing sophisticated, structured reasoning and maintaining reliable execution, and it advocates head-aware training policies that balance exploration, precision, and calculational reliability for robust reasoning systems.

Abstract

The remarkable capabilities of modern large reasoning models are largely unlocked through post-training techniques such as supervised fine-tuning and reinforcement learning. However, the architectural mechanisms behind such improvements remain largely opaque. In this work, we use circuit analysis to demonstrate that post-training for complex reasoning sparks the emergence of novel, functionally specialized attention heads. These heads collectively support structured reasoning and computation. Our comparative analysis across Qwen families and DeepSeek-distilled model reveals that these emergent heads evolve differently under different training regimes. Distillation and SFT foster a cumulative addition of stable reasoning heads. In contrast, group relative policy optimization operates in a dynamic search mode: relatively few attention heads are iteratively activated, evaluated, and pruned, with their survival closely tracking fluctuations in the task reward signal. Furthermore, we find that controllable think on/off models do not possess dedicated thinking heads. Instead, turning off explicit reasoning triggers a broader-but less efficient-set of compensatory heads. Through ablation and qualitative analyses, we connect these circuit-level dynamics to a crucial performance trade-off: strengthened heads enable sophisticated problem-solving strategies for difficult problems but can also introduce over-thinking failure modes, such as calculation errors or logical loops on simpler tasks. These findings connect circuit-level dynamics to macro-level performance, identifying an inherent tension where complex reasoning comes at the cost of elementary computations. More broadly, our work points to future directions for training policy design, emphasizing the need to balance the development of effective reasoning strategies with the assurance of reliable, flawless execution.

Paper Structure

This paper contains 40 sections, 29 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Reasoning circuits trace the internal computations of LRMs at each checkpoint. After post-training, newly activated attention heads influence the performance at those checkpoints.
  • Figure 2: Analysis of Emergent Attention Head in Qwen2.5-Math-1.5B during GRPO. (A) denotes a cohort analysis of attention head activation across trained checkpoints. The blue line tracks the absolute number of newly activated heads compared to the base model, while the red dashed line indicates the number of original heads that are maintained. The stacked areas represent cohorts of heads, color-coded by the checkpoint at which they first emerged, showing their persistence and evolution over time. The fluctuation in newly activated heads shows a similar trend to the (B), accuracy reward curve. (C) shows a heatmap detailing the changes in activation frequency. Red cells denote heads from the original base model, with fading intensity indicating their gradual deactivation. Blue cells represent newly emerged heads, with darker shades signifying higher activation frequency across checkpoints. Heads active in the final checkpoint are outlined with a black border.
  • Figure 3: Analysis of Emergent Attention Head in Qwen2.5-Math-1.5B during SFT. (A) denotes a cohort analysis of attention head activation over training checkpoints. The blue line tracks the absolute number of newly activated heads compared to the base model, while the red dashed line indicates the number of original heads that are maintained. The stacked areas represent cohorts of heads, color-coded by the checkpoint at which they first emerged, showing their persistence and evolution over time. (B) shows a heatmap detailing the changes in activation frequency. Red cells denote heads from the original base model, with fading intensity indicating their gradual deactivation. Blue cells represent newly emerged heads, with darker shades signifying higher activation frequency across checkpoints. Heads active in the final checkpoint are outlined with a black border.
  • Figure 4: Performance change among various benchmarks for each checkpoints of GRPO training with two different training dataset: GSM8K gsm8k and OpenR1-Math-220k openr1. The green and red arrow indicate impressive performance gain and lose among various checkpoints, and the captions are the summaries of qualitative analysis. The performance trade-off of each checkpoints is similarly reproduced when we apply attention head scaling with emergent reasoning heads for the baseline model. Actual examples are presented in the Appendix \ref{['app:qualitative_analysis']} to \ref{['app:qualitative_analysis_incorrect']}.
  • Figure 5: Performance difference against increasing coverage. The left figure shows pass@k difference when sampling coverage increased, while the right figure shows efficient correctness with success@k.
  • ...and 8 more figures