Table of Contents
Fetching ...

Not all tokens are needed(NAT): token efficient reinforcement learning

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang

TL;DR

Not All Tokens Are Needed (NAT), a unified framework that makes the token budget a first-class optimization primitive and provides an efficient and orthogonal pathway to scaling RL beyond the limits imposed by long trajectories.

Abstract

Reinforcement learning (RL) has become a key driver of progress in large language models, but scaling RL to long chain-of-thought (CoT) trajectories is increasingly constrained by backpropagation over every generated token. Even with optimized rollout engines, full-token updates can consume a large fraction of total training cost, turning token length into a hidden tax on RL. We introduce Not All Tokens Are Needed (NAT), a unified framework that makes the token budget a first-class optimization primitive. NAT updates the policy using only a selected subset of generated tokens while preserving the learning signal of full-sequence RL. The core idea is an unbiased partial-token policy-gradient estimator via Horvitz-Thompson reweighting, which ensures statistically correct gradients despite subsampling. We instantiate NAT with two simple, plug-and-play token selection schemes: Uniform Random Sampling (URS) and Random Prefix Cutting (RPC), both of which reduce forward and backward compute and memory without modifying the reward computation or rollout pipeline. Across mathematical reasoning benchmarks, NAT matches full-token GRPO performance while using as few as 50% of tokens, providing an efficient and orthogonal pathway to scaling RL beyond the limits imposed by long trajectories. In our experiments, RPC saves 18% peak GPU memory and 29% forward and backward RL training time for Qwen3-8B.

Not all tokens are needed(NAT): token efficient reinforcement learning

TL;DR

Not All Tokens Are Needed (NAT), a unified framework that makes the token budget a first-class optimization primitive and provides an efficient and orthogonal pathway to scaling RL beyond the limits imposed by long trajectories.

Abstract

Reinforcement learning (RL) has become a key driver of progress in large language models, but scaling RL to long chain-of-thought (CoT) trajectories is increasingly constrained by backpropagation over every generated token. Even with optimized rollout engines, full-token updates can consume a large fraction of total training cost, turning token length into a hidden tax on RL. We introduce Not All Tokens Are Needed (NAT), a unified framework that makes the token budget a first-class optimization primitive. NAT updates the policy using only a selected subset of generated tokens while preserving the learning signal of full-sequence RL. The core idea is an unbiased partial-token policy-gradient estimator via Horvitz-Thompson reweighting, which ensures statistically correct gradients despite subsampling. We instantiate NAT with two simple, plug-and-play token selection schemes: Uniform Random Sampling (URS) and Random Prefix Cutting (RPC), both of which reduce forward and backward compute and memory without modifying the reward computation or rollout pipeline. Across mathematical reasoning benchmarks, NAT matches full-token GRPO performance while using as few as 50% of tokens, providing an efficient and orthogonal pathway to scaling RL beyond the limits imposed by long trajectories. In our experiments, RPC saves 18% peak GPU memory and 29% forward and backward RL training time for Qwen3-8B.
Paper Structure (59 sections, 1 theorem, 26 equations, 6 figures, 3 tables)

This paper contains 59 sections, 1 theorem, 26 equations, 6 figures, 3 tables.

Key Result

Proposition 1

For any inclusion probabilities $\{p_{i,t}\}_{t=1}^{T_i}$ with $p_{i,t}>0$, $\mathbb{E}_m[\widehat{\mu}_i^{\text{HT}}(\theta)] = \mu_i(\theta)$. Moreover, under standard regularity conditions allowing interchange of gradient and expectation, $\mathbb{E}_m[\nabla_\theta \widehat{\mu}_i^{\text{HT}}(\t

Figures (6)

  • Figure 1: Barplots of Qwen3-8B RL training metrics with 95% CIs across 5 runs for GRPO (vanilla GRPO), URS (GRPO with random sampling $p=0.5$), Det. Trunc. (GRPO with deterministic prefix truncation of 50% tokens) and RPC (GRPO with uniform random prefix cutting).
  • Figure 2: Entropy curves with 95% confidence interval across 5 runs for GRPO (vanilla GRPO), URS (GRPO with random sampling $p=0.5$), Det. Trunc. (GRPO with deterministic prefix truncation of 50% of trajectory tokens) and RPC (GRPO with uniform random prefix cutting).
  • Figure 3: Percentage of selected tokens with 95% confidence interval across 5 runs for RPC (GRPO with uniform random prefix cutting).
  • Figure 4: Gradient norm curves with 95% confidence interval across 5 runs for GRPO (vanilla GRPO), URS (GRPO with random sampling $p=0.5$), Det. Trunc. (GRPO with deterministic prefix truncation of 50% of trajectory tokens) and RPC (GRPO with uniform random prefix cutting).
  • Figure 5: Time per step (excluding inference time) with 95% confidence interval across 5 runs for RPC (GRPO with uniform random prefix cutting).
  • ...and 1 more figures

Theorems & Definitions (2)

  • Proposition 1: Unbiasedness of HT token masking
  • proof : Proof of Gradient Unbiasedness