Table of Contents
Fetching ...

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng, Jiang Wu, Yinuo Wang, Yang Guan, Kehua Sheng, Bo Zhang, Keqiang Li, Jingliang Duan, Shengbo Eben Li

TL;DR

STAPO identifies a minor set of spurious tokens that receive amplified policy gradients during RL fine-tuning of LLMs, destabilizing reasoning. It introduces S2T masking to silence these tokens and STAPO as a group-based objective that downweights their gradient contributions, resulting in improved entropy stability and reasoning accuracy. Across six mathematical benchmarks and three Qwen base-model scales (1.7B, 8B, 14B), STAPO achieves state-of-the-art performance and stronger training stability, with average gains of approximately 7.13 percentage points in training-aligned settings and 3.69 percentage points under alternative evaluation settings. The practical impact is a more reliable RL-based alignment of LLMs for reasoning tasks, with a targeted, minimal intervention that preserves legitimate exploration while suppressing destructive, rare signals.

Abstract

Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training. Our analysis shows that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. We find that training instability can be caused by a tiny fraction of tokens, approximately 0.01\%, which we term \emph{spurious tokens}. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. To mitigate this instability, we design S2T (silencing spurious tokens) mechanism to efficiently identify spurious tokens through characteristic signals with low probability, low entropy, and positive advantage, and then to suppress their gradient perturbations during optimization. Incorporating this mechanism into a group-based objective, we propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13\% ($ρ_{\mathrm{T}}$=1.0, top-p=1.0) and 3.69\% ($ρ_{\mathrm{T}}$=0.7, top-p=0.9) over GRPO, 20-Entropy and JustRL.

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

TL;DR

STAPO identifies a minor set of spurious tokens that receive amplified policy gradients during RL fine-tuning of LLMs, destabilizing reasoning. It introduces S2T masking to silence these tokens and STAPO as a group-based objective that downweights their gradient contributions, resulting in improved entropy stability and reasoning accuracy. Across six mathematical benchmarks and three Qwen base-model scales (1.7B, 8B, 14B), STAPO achieves state-of-the-art performance and stronger training stability, with average gains of approximately 7.13 percentage points in training-aligned settings and 3.69 percentage points under alternative evaluation settings. The practical impact is a more reliable RL-based alignment of LLMs for reasoning tasks, with a targeted, minimal intervention that preserves legitimate exploration while suppressing destructive, rare signals.

Abstract

Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training. Our analysis shows that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. We find that training instability can be caused by a tiny fraction of tokens, approximately 0.01\%, which we term \emph{spurious tokens}. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. To mitigate this instability, we design S2T (silencing spurious tokens) mechanism to efficiently identify spurious tokens through characteristic signals with low probability, low entropy, and positive advantage, and then to suppress their gradient perturbations during optimization. Incorporating this mechanism into a group-based objective, we propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13\% (=1.0, top-p=1.0) and 3.69\% (=0.7, top-p=0.9) over GRPO, 20-Entropy and JustRL.
Paper Structure (27 sections, 4 theorems, 18 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 27 sections, 4 theorems, 18 equations, 6 figures, 2 tables, 1 algorithm.

Key Result

Theorem 3.1

(Policy Gradient Norm Bounds). Consider the optimization objective at step $t$ for sample $i$ with target token $y_{i,t}$. The squared $\ell_2$-norm of the gradient $\nabla_{\bm{a}} \mathcal{J}$ w.r.t. the logits $\bm{a}$ is directly bounded by the entropy $\mathcal{H}(\pi_{\theta})$ and the target where $C_V=\frac{|\mathcal{V}| - 1}{|\mathcal{V}| (\ln |\mathcal{V}|)^2}$ and the weight $w_{i,t}$

Figures (6)

  • Figure 1: Core Idea. (a) Conceptual analogy: We argue that spurious tokens, which are rare and uninformative tokens within otherwise correct responses that receive disproportionately large gradient updates, can harm training stability, analogous to a dissonant vocalist disrupting the harmony of a performance. (b) By masking this negligible fraction (near $0.01\%$) of spurious tokens during the RL process of Qwen3-8B-Base, STAPO approaches the Pareto frontier of performance (AIME24 Acc) and entropy stability, compared to GRPO, 20-Entropy, and JustRL.
  • Figure 2: Comprehensive Analysis of Spurious Tokens.
  • Figure 3: Training Results across Different Models. Each row presents the training dynamics for a specific model size (1.7B, 8B, and 14B). Notably, STAPO achieves superior performance while maintaining stable policy entropy across all model sizes.
  • Figure 4: Sensitivity Analysis. Performance is reported as Acc (avg@32) on AIME24 and AIME25 benchmarks for Qwen3-1.7B base.
  • Figure 5: Ablation Results of Different Masking Strategies.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Theorem 3.1
  • proof
  • Lemma 3.2: Entropy Update Mechanism cui2025entropy
  • Lemma 3.3: Entropy-Conditioned Learning Potential wang2025beyond
  • Definition 3.4: Spurious Tokens
  • Remark 3.5: Thresholding Strategy
  • Lemma A.1: Gradient Norm Decomposition yang2025not
  • proof