Table of Contents
Fetching ...

Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning

Hanbing Liu, Lang Cao, Yuanyi Ren, Mengyu Zhou, Haoyu Dong, Xiaojun Ma, Shi Han, Dongmei Zhang

Abstract

Large language models (LLMs) show strong reasoning abilities but often produce unnecessarily long explanations that reduce efficiency. Although reinforcement learning (RL) has been used to improve reasoning, most methods focus on accuracy and rely on uniform length-based rewards that overlook the differing contributions of individual tokens, often harming correctness. We revisit length optimization in RL through the perspective of token significance. Observing that many chain-of-thought (CoT) tokens contribute little to the final answer, we introduce a significance-aware length reward that selectively penalizes insignificance tokens, reducing redundancy while preserving essential reasoning. We also propose a dynamic length reward that encourages more detailed reasoning early in training and gradually shifts toward conciseness as learning progresses. Integrating these components into standard policy optimization yields a framework that improves both reasoning efficiency and accuracy. Experiments across multiple benchmarks demonstrate substantial reductions in response length while preserving or improving correctness, highlighting the importance of modeling token significance for efficient LLM reasoning.

Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning

Abstract

Large language models (LLMs) show strong reasoning abilities but often produce unnecessarily long explanations that reduce efficiency. Although reinforcement learning (RL) has been used to improve reasoning, most methods focus on accuracy and rely on uniform length-based rewards that overlook the differing contributions of individual tokens, often harming correctness. We revisit length optimization in RL through the perspective of token significance. Observing that many chain-of-thought (CoT) tokens contribute little to the final answer, we introduce a significance-aware length reward that selectively penalizes insignificance tokens, reducing redundancy while preserving essential reasoning. We also propose a dynamic length reward that encourages more detailed reasoning early in training and gradually shifts toward conciseness as learning progresses. Integrating these components into standard policy optimization yields a framework that improves both reasoning efficiency and accuracy. Experiments across multiple benchmarks demonstrate substantial reductions in response length while preserving or improving correctness, highlighting the importance of modeling token significance for efficient LLM reasoning.

Paper Structure

This paper contains 39 sections, 2 theorems, 45 equations, 13 figures, 8 tables.

Key Result

Lemma 1

Let $\hat{Z}_{\text{full}}$ denote the answer decoded from the full CoT, and $\hat{Z}_{\text{sig}}$ the answer decoded after removing $\mathcal{Y}_{\mathrm{insig}}$. Under Assumption as:mi-proxy, the increase in error probability is bounded: $\blacktriangleleft$$\blacktriangleleft$

Figures (13)

  • Figure 1: (A) Limitations of Uniform Length Penalization. LLMs often produce verbose and redundant reasoning, and applying a uniform length penalty, as done in many prior RL approaches, fails to account for the differing importance of individual tokens, which can lead to accuracy degradation; (B) Reinforcement Learning with Token Significance and Dynamic Length Control. Our method models token significance to selectively penalize unimportant tokens and introduces dynamic length control to balance exploration and conciseness throughout training, enabling LLMs to generate reasoning that is both accurate and efficient.
  • Figure 2: Illustration of the Bingo framework. Given a generated CoT trace, the LLM first distinguishes between significant and insignificant tokens. A dynamic length reward is then computed based on token type and sample correctness. During the early exploration phase of training ($k(t) \geq \beta$), the reward encourages extended reasoning for significant tokens in incorrect samples while penalizing insignificant tokens in all cases. As training progresses ($k(t) < \beta$), the reward shifts toward promoting conciseness by discouraging both significant and insignificant length where appropriate. This two-stage strategy allows the model to first explore broadly and then compress effectively. The aggregated rewards are then used to update the policy via RL, resulting in more accurate and efficient reasoning.
  • Figure 3: Significant Length Ratio dynamics during training. The x-axis indicates training steps, and the y-axis denotes the proportion of significant tokens in the generated responses. Each subplot corresponds to one benchmark evaluated using DeepSeek-R1-Distill-Qwen-1.5B as the base model. The blue curve represents the baseline method (Vanilla PPO), and the red curve represents our approach (Ours).
  • Figure 4: Penalty curve:$\sqrt{1 - \frac{L}{L_{\max}}}$.
  • Figure 5: Performance overview of Bingo and other baselines.Left: Scatter plot of average accuracy versus average response length on four benchmarks (MATH500, GSM8K, TheoremQA, AIME2024) using DeepSeek-R1-Distill-Qwen-1.5B as the base model. Points nearer the top‑right corner represent a better balance of accuracy and efficiency. Right: Radar chart of length‑normalized accuracy for each method. Greater radial distances denote higher efficiency.
  • ...and 8 more figures

Theorems & Definitions (5)

  • Lemma 1: Bounded Accuracy Loss
  • proof
  • Definition 1: General vs. Significance-Aware Length Reward
  • Theorem 1: Benefit of the Significance-Aware Reward
  • proof