Table of Contents
Fetching ...

The Path Not Taken: RLVR Provably Learns Off the Principals

Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, David Z. Pan, Zhangyang Wang, Yuandong Tian, Kai Sheng Tai

TL;DR

RLVR achieves improved reasoning with verifiable rewards while inducing predominantly off-principal, small, geometry-friendly updates. The authors formalize a Three-Gate Theory—KL Anchor, Model Geometry, and Precision—that explains why updates land in restricted regions and preserve spectral structure, contrasting RLVR with SFT, which targets principal directions and distorts the spectrum. By combining bf16-aware sparsity probes, layer-wide update maps, and geometry-manipulation interventions, they demonstrate a model-conditioned optimization bias that persists across tasks, models, and RLHF. The findings argue for geometry-aware, RL-native parameter-efficient methods rather than repurposing SFT-era heuristics, with practical implications for LoRA and PiSSA in RL settings.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) reliably improves the reasoning performance of large language models, yet it appears to modify only a small fraction of parameters. We revisit this paradox and show that sparsity is a surface artifact of a model-conditioned optimization bias: for a fixed pretrained model, updates consistently localize to preferred parameter regions, highly consistent across runs and largely invariant to datasets and RL recipes. We mechanistically explain these dynamics with a Three-Gate Theory: Gate I (KL Anchor) imposes a KL-constrained update; Gate II (Model Geometry) steers the step off principal directions into low-curvature, spectrum-preserving subspaces; and Gate III (Precision) hides micro-updates in non-preferred regions, making the off-principal bias appear as sparsity. We then validate this theory and, for the first time, provide a parameter-level characterization of RLVR's learning dynamics: RLVR learns off principal directions in weight space, achieving gains via minimal spectral drift, reduced principal-subspace rotation, and off-principal update alignment. In contrast, SFT targets principal weights, distorts the spectrum, and even lags RLVR. Together, these results provide the first parameter-space account of RLVR's training dynamics, revealing clear regularities in how parameters evolve. Crucially, we show that RL operates in a distinct optimization regime from SFT, so directly adapting SFT-era parameter-efficient fine-tuning (PEFT) methods can be flawed, as evidenced by our case studies on advanced sparse fine-tuning and LoRA variants. We hope this work charts a path toward a white-box understanding of RLVR and the design of geometry-aware, RLVR-native learning algorithms, rather than repurposed SFT-era heuristics.

The Path Not Taken: RLVR Provably Learns Off the Principals

TL;DR

RLVR achieves improved reasoning with verifiable rewards while inducing predominantly off-principal, small, geometry-friendly updates. The authors formalize a Three-Gate Theory—KL Anchor, Model Geometry, and Precision—that explains why updates land in restricted regions and preserve spectral structure, contrasting RLVR with SFT, which targets principal directions and distorts the spectrum. By combining bf16-aware sparsity probes, layer-wide update maps, and geometry-manipulation interventions, they demonstrate a model-conditioned optimization bias that persists across tasks, models, and RLHF. The findings argue for geometry-aware, RL-native parameter-efficient methods rather than repurposing SFT-era heuristics, with practical implications for LoRA and PiSSA in RL settings.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) reliably improves the reasoning performance of large language models, yet it appears to modify only a small fraction of parameters. We revisit this paradox and show that sparsity is a surface artifact of a model-conditioned optimization bias: for a fixed pretrained model, updates consistently localize to preferred parameter regions, highly consistent across runs and largely invariant to datasets and RL recipes. We mechanistically explain these dynamics with a Three-Gate Theory: Gate I (KL Anchor) imposes a KL-constrained update; Gate II (Model Geometry) steers the step off principal directions into low-curvature, spectrum-preserving subspaces; and Gate III (Precision) hides micro-updates in non-preferred regions, making the off-principal bias appear as sparsity. We then validate this theory and, for the first time, provide a parameter-level characterization of RLVR's learning dynamics: RLVR learns off principal directions in weight space, achieving gains via minimal spectral drift, reduced principal-subspace rotation, and off-principal update alignment. In contrast, SFT targets principal weights, distorts the spectrum, and even lags RLVR. Together, these results provide the first parameter-space account of RLVR's training dynamics, revealing clear regularities in how parameters evolve. Crucially, we show that RL operates in a distinct optimization regime from SFT, so directly adapting SFT-era parameter-efficient fine-tuning (PEFT) methods can be flawed, as evidenced by our case studies on advanced sparse fine-tuning and LoRA variants. We hope this work charts a path toward a white-box understanding of RLVR and the design of geometry-aware, RLVR-native learning algorithms, rather than repurposed SFT-era heuristics.

Paper Structure

This paper contains 59 sections, 24 theorems, 44 equations, 18 figures, 5 tables.

Key Result

Proposition 3.1

Let $q(\cdot\mid x)$ be a full‑support reference and let $\tilde{q}_\beta(\cdot\mid x)\propto q(\cdot\mid x)\exp(R/\beta)$ denote the soft‑regularized improvement oracle. Let $\theta^+$ be the parametric fit obtained by the $M$‑projection of $\tilde{q}_\beta$ onto the policy class, $\theta^{+}\in\ar where the $o(1)$ term vanishes as $D_{\mathrm{KL}}(\tilde{q}_\beta\|\pi_\theta)\to 0$.

Figures (18)

  • Figure 1: SFT vs. RLVR: optimization geometry and evidence. (a) SFT follows an externally guided route and traverses high-curvature directions ("over the mountain") to reach the target. (b) RLVR, without an explicit teacher, behaves as if steered by an implicit compass (a model-conditioned optimization bias), taking a low-curvature detour. (c) Evidence. Left: positional maps comparing the update mask (non-zero parameter updates) with the principal mask (positions aligned with top-$k$ singular subspaces, defined by the largest-magnitude entries of the rank-$k$ SVD reconstruction liu2025lift; details in Sec. \ref{['sec:princ']}). RLVR updates avoid principal-weight positions, whereas SFT targets them meng2024pissaliu2025lift. Right: principal-angle curves of the top-$k$ subspaces show that RLVR rotates less (spectrum preserved), while SFT rotates more.
  • Figure 2: Consensus ratio of weight updates. Across five RLVR runs, we plot the 13th layer’s projections (Q/K/V/O) and the MLP down projection. Lighter bands mark coordinates updated in most runs, revealing a stable, stripe-like routing pattern rather than random scatter (zoom in to see fine structure).
  • Figure 3: Temporal emergence of the optimization bias with row and column-wise update ratios for the 13th attention block across gradient update steps ($t\!\in\!\{240,720,1200\}$), smoothed with a 3-step window. The row-dominant (Q) and column-dominant (O) patterns are consistent with the bias structures in Fig. \ref{['fig:strips']}. We visualize the head boundaries with grey dashed lines. The bias appears not only across heads but also within heads.
  • Figure 4: Spectral geometry under SFT vs. RLVR on Qwen3-8B su2025klear. Left: for an exemplar layer, top-$k$ principal angles and singular-value curves. Right: across all layers, maximum principal angle and normalized spectral drift. RLVR maintains a stable top-$k$ spectrum with minimal subspace rotation, unlike SFT. See DS-Qwen-1.5B in Fig. \ref{['fig:rl-spec-1.5b']} and Qwen3-14B-Base in Fig. \ref{['fig:rl-spec-14b']}.
  • Figure 5: RL avoids updating principal weights. We compare the RL update mask with principal weight mask $M_{princ}$, low magnitude mask $M_{low}$, and the one $M_{princ} \cap M_{low}^c$. The layer-wise overlap between RL updates and principal weights is consistently sub-random, an effect more pronounced when removing its overlapped weights with $M_{low}$, i.e., $M_{princ} \cap M_{low}^c$.
  • ...and 13 more figures

Theorems & Definitions (38)

  • Definition 2.1: Unchanged Weight in bf16
  • Definition 2.2: bf16-aware Update Sparsity
  • Proposition 3.1: One‑step policy‑KL leash
  • Proposition 3.2: Policy‑KL leash $\Rightarrow$ weight bound
  • Theorem 3.3: Constrained subspace rotation with Wedin’s sin--$\Theta$ theorem wedin1972perturbation
  • Corollary 3.4: Singular-value stability
  • Corollary 3.5: Top-$k$ energy and Ky Fan norms
  • Corollary 3.6: Magnitude-dependent realization threshold
  • Proposition D.1: Exact invariance
  • Lemma E.1: Gap between distinct bf16 representables
  • ...and 28 more