Table of Contents
Fetching ...

ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems

Qiaoling Chen, Zijun Liu, Peng Sun, Shenggui Li, Guoteng Wang, Ziming Liu, Yonggang Wen, Siyuan Feng, Tianwei Zhang

TL;DR

ReSpec tackles the generation bottleneck in RL-based LLM adaptation by integrating adaptive speculative decoding with a tight feedback loop from on-policy signals. It introduces an Adaptive SD Server and an Online Learner that uses Reward-Weighted Knowledge Distillation and asynchronous updates to keep the lightweight drafter aligned with an evolving actor, while dynamically tuning SD configurations. The approach mitigates three critical issues—diminishing speedups, drafter staleness, and policy degradation—through adaptive scheduling, continual drafter evolution, and reward-aware updates. Empirically, ReSpec achieves up to 4.5x end-to-end speedup on Qwen 3B–14B with stable reward convergence, making efficient RL-based LLM adaptation practical at scale.

Abstract

Adapting large language models (LLMs) via reinforcement learning (RL) is often bottlenecked by the generation stage, which can consume over 75\% of the training time. Speculative decoding (SD) accelerates autoregressive generation in serving systems, but its behavior under RL training remains largely unexplored. We identify three critical gaps that hinder the naive integration of SD into RL systems: diminishing speedups at large batch sizes, drafter staleness under continual actor updates, and drafter-induced policy degradation. To address these gaps, we present ReSpec, a system that adapts SD to RL through three complementary mechanisms: dynamically tuning SD configurations, evolving the drafter via knowledge distillation, and weighting updates by rollout rewards. On Qwen models (3B--14B), ReSpec achieves up to 4.5x speedup while preserving reward convergence and training stability, providing a practical solution for efficient RL-based LLM adaptation.

ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems

TL;DR

ReSpec tackles the generation bottleneck in RL-based LLM adaptation by integrating adaptive speculative decoding with a tight feedback loop from on-policy signals. It introduces an Adaptive SD Server and an Online Learner that uses Reward-Weighted Knowledge Distillation and asynchronous updates to keep the lightweight drafter aligned with an evolving actor, while dynamically tuning SD configurations. The approach mitigates three critical issues—diminishing speedups, drafter staleness, and policy degradation—through adaptive scheduling, continual drafter evolution, and reward-aware updates. Empirically, ReSpec achieves up to 4.5x end-to-end speedup on Qwen 3B–14B with stable reward convergence, making efficient RL-based LLM adaptation practical at scale.

Abstract

Adapting large language models (LLMs) via reinforcement learning (RL) is often bottlenecked by the generation stage, which can consume over 75\% of the training time. Speculative decoding (SD) accelerates autoregressive generation in serving systems, but its behavior under RL training remains largely unexplored. We identify three critical gaps that hinder the naive integration of SD into RL systems: diminishing speedups at large batch sizes, drafter staleness under continual actor updates, and drafter-induced policy degradation. To address these gaps, we present ReSpec, a system that adapts SD to RL through three complementary mechanisms: dynamically tuning SD configurations, evolving the drafter via knowledge distillation, and weighting updates by rollout rewards. On Qwen models (3B--14B), ReSpec achieves up to 4.5x speedup while preserving reward convergence and training stability, providing a practical solution for efficient RL-based LLM adaptation.

Paper Structure

This paper contains 21 sections, 5 equations, 15 figures, 2 tables, 1 algorithm.

Figures (15)

  • Figure 1: Overview of SD in RL training and our proposed ReSpec system. (a) The RL training workflow integrated with SD, highlighting three fundamental gaps. (b) The design of ReSpec, which addresses these gaps through three complementary mechanisms: dynamically tuning SD configurations to match workload conditions, continuously aligning the drafter with the evolving actor using on-policy signals, and weighting drafter updates by rollout quality to maintain stable policy learning. Together, these mechanisms enable stable and efficient SD throughout RL training.
  • Figure 2: EAGLE-3 workflow. The target model generates one token, while the draft model produces multiple candidates using hidden states. The target verifies all candidates in a single forward pass and accepts four tokens (“the key to success”) with only two forward passes.
  • Figure 3: Speedup ratio under different batch sizes and temperature on H100 and Qwen2.5-7B-instruct for the MTBench dataset.
  • Figure 4: Acceptance length of the draft model during 100 RL steps with Qwen2.5-7B-instruct model and math dataset. As training progresses, the EAGLE-3 drafter quickly becomes stale and its acceptance length drops.
  • Figure 5: Evolution of rewards over 200 RL steps for Qwen2.5-7B on the math dataset. Naïve application of EAGLE-3 leads to a measurable drop in reward, illustrating drafter-induced distributional bias during RL training.
  • ...and 10 more figures