Table of Contents
Fetching ...

Beat the long tail: Distribution-Aware Speculative Decoding for RL Training

Zelei Shao, Vikranth Srivatsa, Sanjana Srivastava, Qingyang Wu, Alpay Ariyak, Xiaoxia Wu, Ameen Patel, Jue Wang, Percy Liang, Tri Dao, Ce Zhang, Yiying Zhang, Ben Athiwaratkun, Chenfeng Xu, Junxiong Wang

TL;DR

RL post-training rollouts are the dominant bottleneck due to long-tail trajectory lengths. DAS couples a nonparametric, history-driven suffix-tree drafter with a length-aware speculative-budgeting policy to accelerate rollouts without changing the learned policy or rewards. Key innovations include per-problem suffix-tree drafters, sliding-window history, and runtime length prediction that adapt budgets to problem difficulty, yielding up to 50% rollout-time reductions on math and code RL tasks while preserving training curves. This distribution-aware speculative decoding approach offers a practical path to substantially faster RL post-training for large language models.

Abstract

Reinforcement learning(RL) post-training has become essential for aligning large language models (LLMs), yet its efficiency is increasingly constrained by the rollout phase, where long trajectories are generated token by token. We identify a major bottleneck:the long-tail distribution of rollout lengths, where a small fraction of long generations dominates wall clock time and a complementary opportunity; the availability of historical rollouts that reveal stable prompt level patterns across training epochs. Motivated by these observations, we propose DAS, a Distribution Aware Speculative decoding framework that accelerates RL rollouts without altering model outputs. DAS integrates two key ideas: an adaptive, nonparametric drafter built from recent rollouts using an incrementally maintained suffix tree, and a length aware speculation policy that allocates more aggressive draft budgets to long trajectories that dominate makespan. This design exploits rollout history to sustain acceptance while balancing base and token level costs during decoding. Experiments on math and code reasoning tasks show that DAS reduces rollout time up to 50% while preserving identical training curves, demonstrating that distribution-aware speculative decoding can significantly accelerate RL post training without compromising learning quality.

Beat the long tail: Distribution-Aware Speculative Decoding for RL Training

TL;DR

RL post-training rollouts are the dominant bottleneck due to long-tail trajectory lengths. DAS couples a nonparametric, history-driven suffix-tree drafter with a length-aware speculative-budgeting policy to accelerate rollouts without changing the learned policy or rewards. Key innovations include per-problem suffix-tree drafters, sliding-window history, and runtime length prediction that adapt budgets to problem difficulty, yielding up to 50% rollout-time reductions on math and code RL tasks while preserving training curves. This distribution-aware speculative decoding approach offers a practical path to substantially faster RL post-training for large language models.

Abstract

Reinforcement learning(RL) post-training has become essential for aligning large language models (LLMs), yet its efficiency is increasingly constrained by the rollout phase, where long trajectories are generated token by token. We identify a major bottleneck:the long-tail distribution of rollout lengths, where a small fraction of long generations dominates wall clock time and a complementary opportunity; the availability of historical rollouts that reveal stable prompt level patterns across training epochs. Motivated by these observations, we propose DAS, a Distribution Aware Speculative decoding framework that accelerates RL rollouts without altering model outputs. DAS integrates two key ideas: an adaptive, nonparametric drafter built from recent rollouts using an incrementally maintained suffix tree, and a length aware speculation policy that allocates more aggressive draft budgets to long trajectories that dominate makespan. This design exploits rollout history to sustain acceptance while balancing base and token level costs during decoding. Experiments on math and code reasoning tasks show that DAS reduces rollout time up to 50% while preserving identical training curves, demonstrating that distribution-aware speculative decoding can significantly accelerate RL post training without compromising learning quality.

Paper Structure

This paper contains 29 sections, 15 equations, 13 figures.

Figures (13)

  • Figure 1: Effective batch size collapse during rollout w/o DAS. Measured on the DeepSeek-distilled 7B model guo2025deepseek (DeepScaleR deepscaler2025 prompts). As decoding progresses, short sequences finish first and the effective batch size shrinks, leaving a few long stragglers to determine the step makespan. With our method, we can both reduce the total latency and alleviating the impact of long-tail stragglers.
  • Figure 2: (Left) Content similarity per iteration using an N-gram to calculate this reuse ratio. (Right) Pairwise similarity across epochs for Qwen2.5-7B-Instruct. The block structure concentrated near the diagonal shows that rollouts are most similar to those from recent epochs, and similarity decays with temporal distance. This reflects policy drift: as the policy is continually updated, older generations become less predictive of current behavior.
  • Figure 3: Overview of our rollout acceleration framework in RL training. (Left) Length-aware draft budget allocation. We estimate per-problem length from recent rollouts and assign a draft budget accordingly: problems predicted to be long or hard are allocated a more aggressive speculative budget, while easy problems receive little or no speculation. This policy is updated over a sliding window of recent trajectories, so it adapts as the policy changes over training. (Right) Distribution-aware, self-evolving speculative decoding. For each problem shard, we maintain a suffix tree speculator that is incrementally updated from most recent rollouts. At decode time, the speculator proposes multi-token drafts drawn from high-frequency suffix matches, and the target model verifies them in parallel; accepted tokens advance generation with reduced rollout latency.
  • Figure 4: Average accepted tokens per verification round in RL training. We compare a static learned drafter , EAGLEli2024eagle2 with our training free drafter. While the static drafter’s acceptance stays flat, our non-parametric drafter updates with recent rollouts and tracks the evolving policy, yielding higher accepted length over time. Higher accepted length implies fewer target forward passes per generated token and thus lower rollout latency.
  • Figure 5: Performance comparison of suffix tree and suffix array data structures. Left: Speculation time across different corpus sizes. Right: Update time for inserting 100 tokens (log scale).
  • ...and 8 more figures