Table of Contents
Fetching ...

Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLMs

Honglin Zhang, Qianyue Hao, Fengli Xu, Yong Li

TL;DR

This study investigates the internal mechanisms by which online reinforcement-learning post-training enhances LLMs beyond supervised fine-tuning. Using Edge Attribution Patching on a graph representation of Transformer residuals, the authors compare pre- and post-RL models across four model families and mathematical benchmarks, finding that RL increases both activation intensity and diversity of activation patterns in internal pathways. Direct Preference Optimization (DPO) shows weaker or inconsistent internal changes, highlighting key differences between online RL and offline preference methods. The results support a unified view that RL reshapes information flow to be more redundant and flexible, contributing to improved mathematical generalization and providing practical guidance for post-training algorithm design. Limitations include domain specificity to math tasks, scale restrictions, and architectural focus on LLaMA-like models.

Abstract

Large language models (LLMs) acquire extensive prior knowledge through large-scale pretraining and can be further enhanced via supervised fine-tuning (SFT) or reinforcement learning (RL)-based post-training. A growing body of evidence has shown that RL fine-tuning improves the capability of LLMs beyond what SFT alone achieves. However, the underlying mechanisms why RL fine-tuning is able to enhance the capability of various LLMs with distinct intrinsic characteristics remain underexplored. In this study, we draw inspiration from prior work on edge attribution patching (EAP) to investigate the internal differences of LLMs before and after RL fine-tuning. Our analysis across multiple model families and mathematical datasets shows two robust effects of online RL post-training: (i) an overall increase in average activation intensity, indicating that more internal pathways are engaged and their signals become stronger, and (ii) greater diversity in activation patterns, reflected by higher entropy and less concentrated edge distributions. These changes suggest that RL reshapes information flow to be both more redundant and more flexible, which may explain its advantage in mathematical generalization. Notably, models fine-tuned with Direct Preference Optimization (DPO) deviate from these trends, exhibiting substantially weaker or inconsistent internal changes compared to PPO- and GRPO-based training. Together, our findings provide a unified view of how RL fine-tuning systematically alters the internal circuitry of LLMs and highlight the methodological distinctions between online RL and preference-based approaches. Our code is open source at https://github.com/tsinghua-fib-lab/llm_rl_probing_analysis.

Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLMs

TL;DR

This study investigates the internal mechanisms by which online reinforcement-learning post-training enhances LLMs beyond supervised fine-tuning. Using Edge Attribution Patching on a graph representation of Transformer residuals, the authors compare pre- and post-RL models across four model families and mathematical benchmarks, finding that RL increases both activation intensity and diversity of activation patterns in internal pathways. Direct Preference Optimization (DPO) shows weaker or inconsistent internal changes, highlighting key differences between online RL and offline preference methods. The results support a unified view that RL reshapes information flow to be more redundant and flexible, contributing to improved mathematical generalization and providing practical guidance for post-training algorithm design. Limitations include domain specificity to math tasks, scale restrictions, and architectural focus on LLaMA-like models.

Abstract

Large language models (LLMs) acquire extensive prior knowledge through large-scale pretraining and can be further enhanced via supervised fine-tuning (SFT) or reinforcement learning (RL)-based post-training. A growing body of evidence has shown that RL fine-tuning improves the capability of LLMs beyond what SFT alone achieves. However, the underlying mechanisms why RL fine-tuning is able to enhance the capability of various LLMs with distinct intrinsic characteristics remain underexplored. In this study, we draw inspiration from prior work on edge attribution patching (EAP) to investigate the internal differences of LLMs before and after RL fine-tuning. Our analysis across multiple model families and mathematical datasets shows two robust effects of online RL post-training: (i) an overall increase in average activation intensity, indicating that more internal pathways are engaged and their signals become stronger, and (ii) greater diversity in activation patterns, reflected by higher entropy and less concentrated edge distributions. These changes suggest that RL reshapes information flow to be both more redundant and more flexible, which may explain its advantage in mathematical generalization. Notably, models fine-tuned with Direct Preference Optimization (DPO) deviate from these trends, exhibiting substantially weaker or inconsistent internal changes compared to PPO- and GRPO-based training. Together, our findings provide a unified view of how RL fine-tuning systematically alters the internal circuitry of LLMs and highlight the methodological distinctions between online RL and preference-based approaches. Our code is open source at https://github.com/tsinghua-fib-lab/llm_rl_probing_analysis.

Paper Structure

This paper contains 29 sections, 16 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Schematic of a two-layer simplified LLM. (a) Residual perspective, (b) graph perspective, and (c) edge importance estimation: above the dashed line, ACDC-style methods measure the loss change after edge ablation ($\textcircled{2}-\textcircled{1}$), and below, EAP-style methods approximate this via backpropagated gradients ($-\textcircled{3} \approx \textcircled{2}-\textcircled{1}$).
  • Figure 2: Relative change in edge activation strength after RL fine-tuning for the Mistral model on the MATH dataset with $\alpha=0.5$.
  • Figure 3: Comparison before and after RL fine-tuning: (a) diversity of activation patterns across inference samples, including data from all datasets and $\alpha$ values; (b) entropy of output edge patterns per node. In (b), data points are arranged sequentially by dataset (College Math, GSM8K, MATH), iterating over $\alpha \in \{0.03, 0.1, 0.3, 0.5\}$ for each.
  • Figure 4: Smoothed moving-average curves of reward and group reward standard deviation during RL training under different temperatures, with a sliding window length of 8.
  • Figure 5: (a) Accuracy of models trained with different numbers of RL steps on the GSM8K test set. (b), (c), and (d) show, respectively, the differences in Activation Intensity, Distribution Stability, and Information Complexity between models trained with different numbers of RL steps and the initial model. We highlight the training intervals where the performance gains are most pronounced for temperature = 0.6 and temperature = 1.0, corresponding to [60, 120] and [40, 100], and mark the extrema of each metric within these intervals in the expected direction.