STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

Shiqi Liu; Zeyu He; Guojian Zhan; Letian Tao; Zhilong Zheng; Jiang Wu; Yinuo Wang; Yang Guan; Kehua Sheng; Bo Zhang; Keqiang Li; Jingliang Duan; Shengbo Eben Li

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng, Jiang Wu, Yinuo Wang, Yang Guan, Kehua Sheng, Bo Zhang, Keqiang Li, Jingliang Duan, Shengbo Eben Li

TL;DR

STAPO identifies a minor set of spurious tokens that receive amplified policy gradients during RL fine-tuning of LLMs, destabilizing reasoning. It introduces S2T masking to silence these tokens and STAPO as a group-based objective that downweights their gradient contributions, resulting in improved entropy stability and reasoning accuracy. Across six mathematical benchmarks and three Qwen base-model scales (1.7B, 8B, 14B), STAPO achieves state-of-the-art performance and stronger training stability, with average gains of approximately 7.13 percentage points in training-aligned settings and 3.69 percentage points under alternative evaluation settings. The practical impact is a more reliable RL-based alignment of LLMs for reasoning tasks, with a targeted, minimal intervention that preserves legitimate exploration while suppressing destructive, rare signals.

Abstract

Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training. Our analysis shows that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. We find that training instability can be caused by a tiny fraction of tokens, approximately 0.01\%, which we term \emph{spurious tokens}. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. To mitigate this instability, we design S2T (silencing spurious tokens) mechanism to efficiently identify spurious tokens through characteristic signals with low probability, low entropy, and positive advantage, and then to suppress their gradient perturbations during optimization. Incorporating this mechanism into a group-based objective, we propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13\% ($ρ_{\mathrm{T}}$=1.0, top-p=1.0) and 3.69\% ($ρ_{\mathrm{T}}$=0.7, top-p=0.9) over GRPO, 20-Entropy and JustRL.

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

TL;DR

Abstract

=1.0, top-p=1.0) and 3.69\% (

=0.7, top-p=0.9) over GRPO, 20-Entropy and JustRL.

Paper Structure (27 sections, 4 theorems, 18 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 27 sections, 4 theorems, 18 equations, 6 figures, 2 tables, 1 algorithm.

Introduction
Preliminary
Problem Formulation
Group Relative Policy Optimization (GRPO)
Clip-Higher and Token Normalization
Methodology
Token-Level Optimization Mechanisms in RL Training
Spurious Token as Destructive Update Mode
Spurious-Token-Aware Policy Optimization (STAPO)
Related Work
Experiments
Settings
Main Results
Training Behaviors Analysis
Performance under Training-Aligned Settings
...and 12 more sections

Key Result

Theorem 3.1

(Policy Gradient Norm Bounds). Consider the optimization objective at step $t$ for sample $i$ with target token $y_{i,t}$. The squared $\ell_2$-norm of the gradient $\nabla_{\bm{a}} \mathcal{J}$ w.r.t. the logits $\bm{a}$ is directly bounded by the entropy $\mathcal{H}(\pi_{\theta})$ and the target where $C_V=\frac{|\mathcal{V}| - 1}{|\mathcal{V}| (\ln |\mathcal{V}|)^2}$ and the weight $w_{i,t}$

Figures (6)

Figure 1: Core Idea. (a) Conceptual analogy: We argue that spurious tokens, which are rare and uninformative tokens within otherwise correct responses that receive disproportionately large gradient updates, can harm training stability, analogous to a dissonant vocalist disrupting the harmony of a performance. (b) By masking this negligible fraction (near $0.01\%$) of spurious tokens during the RL process of Qwen3-8B-Base, STAPO approaches the Pareto frontier of performance (AIME24 Acc) and entropy stability, compared to GRPO, 20-Entropy, and JustRL.
Figure 2: Comprehensive Analysis of Spurious Tokens.
Figure 3: Training Results across Different Models. Each row presents the training dynamics for a specific model size (1.7B, 8B, and 14B). Notably, STAPO achieves superior performance while maintaining stable policy entropy across all model sizes.
Figure 4: Sensitivity Analysis. Performance is reported as Acc (avg@32) on AIME24 and AIME25 benchmarks for Qwen3-1.7B base.
Figure 5: Ablation Results of Different Masking Strategies.
...and 1 more figures

Theorems & Definitions (8)

Theorem 3.1
proof
Lemma 3.2: Entropy Update Mechanism cui2025entropy
Lemma 3.3: Entropy-Conditioned Learning Potential wang2025beyond
Definition 3.4: Spurious Tokens
Remark 3.5: Thresholding Strategy
Lemma A.1: Gradient Norm Decomposition yang2025not
proof

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

TL;DR

Abstract

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (8)