Table of Contents
Fetching ...

Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs

Haoming Meng, Kexin Huang, Shaohang Wei, Chiyu Ma, Shuo Yang, Xue Wang, Guoyin Wang, Bolin Ding, Jingren Zhou

Abstract

Reinforcement learning with verifiable rewards (RLVR) has significantly improved reasoning in large language models (LLMs), yet the token-level mechanisms underlying these improvements remain unclear. We present a systematic empirical study of RLVR's distributional effects organized around three main analyses: (1) token-level characterization of distributional shifts between base and RL models, (2) the impact of token-level distributional shifts on sequence-level reasoning performance through cross-sampling interventions, and (3) fine-grained mechanics of these shifts at the token level. We find that RL fine-tuning induces highly sparse and targeted changes, with only a small fraction of token distributions exhibiting meaningful divergence between the base and RL policies. We further characterize the structure and evolution of these shifts through analyses of token entropy, positional concentration, and reallocation of probability mass. To assess the functional importance of these sparse changes, we conduct cross-sampling experiments that selectively swap token choices between the base and RL models with varying intervention budgets. We show that inserting only a small fraction of RL-sampled tokens into base generations progressively recovers RL performance gains, while injecting a similarly small number of base token choices into otherwise RL-generated sequences collapses performance to base levels, isolating a small set of token-level decisions directly responsible for RLVR's performance gains. Finally, we explore divergence-weighted variants of the advantage signal as a diagnostic intervention, finding that they can yield improvements over baselines. Together, our results shed light on the distributional changes induced by RLVR and provide a fine-grained, token-level lens for understanding RLVR fine-tuning as a targeted refinement process.

Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs

Abstract

Reinforcement learning with verifiable rewards (RLVR) has significantly improved reasoning in large language models (LLMs), yet the token-level mechanisms underlying these improvements remain unclear. We present a systematic empirical study of RLVR's distributional effects organized around three main analyses: (1) token-level characterization of distributional shifts between base and RL models, (2) the impact of token-level distributional shifts on sequence-level reasoning performance through cross-sampling interventions, and (3) fine-grained mechanics of these shifts at the token level. We find that RL fine-tuning induces highly sparse and targeted changes, with only a small fraction of token distributions exhibiting meaningful divergence between the base and RL policies. We further characterize the structure and evolution of these shifts through analyses of token entropy, positional concentration, and reallocation of probability mass. To assess the functional importance of these sparse changes, we conduct cross-sampling experiments that selectively swap token choices between the base and RL models with varying intervention budgets. We show that inserting only a small fraction of RL-sampled tokens into base generations progressively recovers RL performance gains, while injecting a similarly small number of base token choices into otherwise RL-generated sequences collapses performance to base levels, isolating a small set of token-level decisions directly responsible for RLVR's performance gains. Finally, we explore divergence-weighted variants of the advantage signal as a diagnostic intervention, finding that they can yield improvements over baselines. Together, our results shed light on the distributional changes induced by RLVR and provide a fine-grained, token-level lens for understanding RLVR fine-tuning as a targeted refinement process.
Paper Structure (65 sections, 4 theorems, 52 equations, 57 figures, 4 tables, 1 algorithm)

This paper contains 65 sections, 4 theorems, 52 equations, 57 figures, 4 tables, 1 algorithm.

Key Result

Lemma A.1

Let $P,Q$ be probability distributions on $X_{1:T_{\max}}$ admitting factorizations $P(x_{1:T_{\max}})=\prod_{t=1}^{T_{\max}} p_t(x_t\mid x_{<t})$ and $Q(x_{1:T_{\max}})=\prod_{t=1}^{T_{\max}} q_t(x_t\mid x_{<t})$. For each $t$, let $P_{<t}$ and $Q_{<t}$ denote the marginals of $X_{<t}$ under $P$ an

Figures (57)

  • Figure 1: Overview: RLVR acts as sparse, high-impact token-level refinement. RL fine-tuning induces sparse distributional shifts: divergence between base and RL token distributions remains near zero at most positions, with only a small subset of tokens exhibiting substantial changes.
  • Figure 2: JS divergence distributions for Qwen2.5 32B DAPO and SimpleRL on AIME 2024.
  • Figure 3: Mean and median JS divergence by normalized token position, with percentile bands. Both methods concentrate updates at the start and, to a lesser degree, at the end of responses.
  • Figure 4: Entropy distributions for low and high divergence distributions for DAPO. Low-divergence tokens are generally low-entropy, while high-divergence tokens span both high- and low-entropy regions, indicating that DAPO can modify even initially confident predictions.
  • Figure 5: Word clouds of high and low divergence tokens under DAPO.
  • ...and 52 more figures

Theorems & Definitions (11)

  • Lemma A.1: KL decomposition
  • proof
  • Proposition A.2: Token-level KL threshold $\Rightarrow$ sequence-level KL bound
  • proof
  • Remark A.3: Effective KL on non-intervention steps
  • Definition A.4: Skew Jensen--Shannon divergence
  • Lemma A.5: JS decomposition via skew JS
  • proof
  • Proposition A.6: Token-level skew-JS control $\Rightarrow$ sequence-level JS bound
  • proof
  • ...and 1 more