Table of Contents
Fetching ...

Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

Qinghao Hu, Shang Yang, Junxian Guo, Xiaozhe Yao, Yujun Lin, Yuxian Gu, Han Cai, Chuang Gan, Ana Klimovic, Song Han

TL;DR

TLT tackles the persistent long-tail rollout bottleneck in reasoning RL training by weaving together an adaptive, lightweight drafter and a memory-efficient, adaptive rollout engine that selects speculative decoding strategies in real time. The approach preserves the exact training dynamics (lossless) while exploiting idle GPU bubbles to train the drafter and reuse pre-captured CUDA graphs, achieving up to $2.1\times$ end-to-end speedups with minimal overhead. Key innovations include a single-layer draft model aligned to the target, opportunistic Spot Training, Bucketed CUDA Graph captures, and a BEG MAB-based SD auto-tuner, plus a model-free drafter as a robust fallback. The results show strong throughput gains across multiple model sizes and hardware, with preserved reward trajectories and a high-quality draft model suitable for deployment, illustrating practical impact for scalable reasoning RL.

Abstract

The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at https://github.com/mit-han-lab/fastrl.

Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

TL;DR

TLT tackles the persistent long-tail rollout bottleneck in reasoning RL training by weaving together an adaptive, lightweight drafter and a memory-efficient, adaptive rollout engine that selects speculative decoding strategies in real time. The approach preserves the exact training dynamics (lossless) while exploiting idle GPU bubbles to train the drafter and reuse pre-captured CUDA graphs, achieving up to end-to-end speedups with minimal overhead. Key innovations include a single-layer draft model aligned to the target, opportunistic Spot Training, Bucketed CUDA Graph captures, and a BEG MAB-based SD auto-tuner, plus a model-free drafter as a robust fallback. The results show strong throughput gains across multiple model sizes and hardware, with preserved reward trajectories and a high-quality draft model suitable for deployment, illustrating practical impact for scalable reasoning RL.

Abstract

The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at https://github.com/mit-han-lab/fastrl.

Paper Structure

This paper contains 20 sections, 17 figures, 4 tables, 1 algorithm.

Figures (17)

  • Figure 1: Observed issues of long-tail generation (rollout) and workload imbalance in reasoning RL. TLT system effectively addresses these challenges with adaptive drafter.
  • Figure 2: RL Training trace from ByteDancedapoTrace, based on the Qwen2.5-32B model qwen25 and executed on H20 96GB GPUs. "p75" denotes the 75th percentile and more granular percentile data were not provided in the original source.
  • Figure 3: Test-time scaling of reasoning models. (a) Performance of OpenAI-o1 O1 and Stanford s1-32B s1 on the AIME Competition-level Math Benchmark AIME. (b) Example of self-reflection correcting an error within the reasoning.
  • Figure 4: Overview of the GRPO DeepSeekMath RL training process.
  • Figure 5: Overview of Speculative Decoding. (a & b) Comparison between Vanilla and Speculative Decoding. (c) Speculative decoding achieves peak compute throughput (TFLOPS) at significantly smaller batch sizes (gray arrow).
  • ...and 12 more figures