Table of Contents
Fetching ...

SPEC-RL: Accelerating On-Policy Reinforcement Learning via Speculative Rollouts

Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang, Yang Liu, Anxiang Zeng, Jinsong Su

TL;DR

This work targets the rollout bottleneck in reinforcement learning with verifiable rewards (RLVR) for large reasoning models. It introduces SPEC-RL, a practical framework that reuses prefixes from previous-epoch rollouts as implicit drafts and verifies them under the current policy, extended by a lenience parameter to balance efficiency and fidelity. By storing a lightweight cache and treating prior rollouts as drafts, SPEC-RL achieves 2-3x rollout speedups across diverse math-reasoning benchmarks and model scales while preserving or improving policy performance, and it integrates seamlessly with PPO, GRPO, and DAPO. The approach offers a general, plug-in pathway to scale RLVR for large reasoning models with minimal risk of bias or reward misalignment, accompanied by open-source code.

Abstract

Large Language Models (LLMs) increasingly rely on reinforcement learning with verifiable rewards (RLVR) to elicit reliable chain-of-thought reasoning. However, the training process remains bottlenecked by the computationally expensive rollout stage. Existing acceleration methods-such as parallelization, objective- and data-driven modifications, and replay buffers-either incur diminishing returns, introduce bias, or overlook redundancy across iterations. We identify that rollouts from consecutive training epochs frequently share a large portion of overlapping segments, wasting computation. To address this, we propose SPEC-RL, a novel framework that integrates SPECulative decoding with the RL rollout process. SPEC-RL reuses prior trajectory segments as speculative prefixes and extends them via a draft-and-verify mechanism, avoiding redundant generation while ensuring policy consistency. Experiments on diverse math reasoning and generalization benchmarks, including AIME24, MATH-500, OlympiadBench, MMLU-STEM, and others, demonstrate that SPEC-RL reduces rollout time by 2-3x without compromising policy quality. As a purely rollout-stage enhancement, SPEC-RL integrates seamlessly with mainstream algorithms (e.g., PPO, GRPO, DAPO), offering a general and practical path to scale RLVR for large reasoning models. Our code is available at https://github.com/ShopeeLLM/Spec-RL

SPEC-RL: Accelerating On-Policy Reinforcement Learning via Speculative Rollouts

TL;DR

This work targets the rollout bottleneck in reinforcement learning with verifiable rewards (RLVR) for large reasoning models. It introduces SPEC-RL, a practical framework that reuses prefixes from previous-epoch rollouts as implicit drafts and verifies them under the current policy, extended by a lenience parameter to balance efficiency and fidelity. By storing a lightweight cache and treating prior rollouts as drafts, SPEC-RL achieves 2-3x rollout speedups across diverse math-reasoning benchmarks and model scales while preserving or improving policy performance, and it integrates seamlessly with PPO, GRPO, and DAPO. The approach offers a general, plug-in pathway to scale RLVR for large reasoning models with minimal risk of bias or reward misalignment, accompanied by open-source code.

Abstract

Large Language Models (LLMs) increasingly rely on reinforcement learning with verifiable rewards (RLVR) to elicit reliable chain-of-thought reasoning. However, the training process remains bottlenecked by the computationally expensive rollout stage. Existing acceleration methods-such as parallelization, objective- and data-driven modifications, and replay buffers-either incur diminishing returns, introduce bias, or overlook redundancy across iterations. We identify that rollouts from consecutive training epochs frequently share a large portion of overlapping segments, wasting computation. To address this, we propose SPEC-RL, a novel framework that integrates SPECulative decoding with the RL rollout process. SPEC-RL reuses prior trajectory segments as speculative prefixes and extends them via a draft-and-verify mechanism, avoiding redundant generation while ensuring policy consistency. Experiments on diverse math reasoning and generalization benchmarks, including AIME24, MATH-500, OlympiadBench, MMLU-STEM, and others, demonstrate that SPEC-RL reduces rollout time by 2-3x without compromising policy quality. As a purely rollout-stage enhancement, SPEC-RL integrates seamlessly with mainstream algorithms (e.g., PPO, GRPO, DAPO), offering a general and practical path to scale RLVR for large reasoning models. Our code is available at https://github.com/ShopeeLLM/Spec-RL

Paper Structure

This paper contains 37 sections, 3 equations, 15 figures, 27 tables, 1 algorithm.

Figures (15)

  • Figure 1: SPEC-RL achieves a 2–3$\times$ speedup in per-step rollout time without compromising average performance on Qwen3-8B-Base across various algorithms.
  • Figure 2: Token overlap ratios across different RL algorithms, computed with ROUGE-1 lin2004rouge by comparing rollout responses from consecutive training epochs.
  • Figure 3: A comparison of the rollout processes in vanilla RLVR and our method. At each training step, vanilla RLVR regenerates full responses. In contrast, SPEC-RL, retrieves cached rollouts from the previous epoch. It then verifies these rollouts in parallel, keeps the verified prefixes, and resumes generation with the current policy to produce the final response.
  • Figure 4: The effect of lenience $\ell$ on model performance and efficiency: (a) averaged test performance, (b) rollout time, and (c) averaged verified prefix length at different training steps.
  • Figure 5: Training dynamics of SPEC-RL under different $\ell$ for three metrics: (a) entropy, (b) KL divergence, and (c) policy gradient clip ratio.
  • ...and 10 more figures