Table of Contents
Fetching ...

When RL Meets Adaptive Speculative Training: A Unified Training-Serving System

Junxiong Wang, Fengxiang Bie, Jisen Li, Zhongzhu Zhou, Zelei Shao, Yubo Wang, Yinghui Liu, Qingyang Wu, Avner May, Sri Yanamandra, Yineng Zhang, Ce Zhang, Tri Dao, Percy Liang, Ben Athiwaratkun, Shuaiwen Leon Song, Chenfeng Xu, Xiaoxia Wu

TL;DR

Aurora introduces a unified training-serving system that treats online speculator learning as an asynchronous reinforcement-learning problem, enabling day-0 deployment and continuous improvement from live inference traces. By coupling an SGLang-based inference server with an asynchronous training server and using lazy synchronization, Aurora closes the training-serving loop, mitigating deployment lag and domain drift in speculative decoding. Key contributions include a practical architecture, an online RL objective with acceptance and discard signals, and showing that simple online finetuning yields substantial performance gains across frontier models, with additional adaptation benefits under distribution shifts. The approach reduces infrastructure overhead by avoiding large offline activation distillation, enables rapid adaptation to traffic, and demonstrates scalable speedups on large models, making speculative decoding more practical for production deployments.

Abstract

Speculative decoding can significantly accelerate LLM serving, yet most deployments today disentangle speculator training from serving, treating speculator training as a standalone offline modeling problem. We show that this decoupled formulation introduces substantial deployment and adaptation lag: (1) high time-to-serve, since a speculator must be trained offline for a considerable period before deployment; (2) delayed utility feedback, since the true end-to-end decoding speedup is only known after training and cannot be inferred reliably from acceptance rate alone due to model-architecture and system-level overheads; and (3) domain-drift degradation, as the target model is repurposed to new domains and the speculator becomes stale and less effective. To address these issues, we present Aurora, a unified training-serving system that closes the loop by continuously learning a speculator directly from live inference traces. Aurora reframes online speculator learning as an asynchronous reinforcement-learning problem: accepted tokens provide positive feedback, while rejected speculator proposals provide implicit negative feedback that we exploit to improve sample efficiency. Our design integrates an SGLang-based inference server with an asynchronous training server, enabling hot-swapped speculator updates without service interruption. Crucially, Aurora supports day-0 deployment: a speculator can be served immediately and rapidly adapted to live traffic, improving system performance while providing immediate utility feedback. Across experiments, Aurora achieves a 1.5x day-0 speedup on recently released frontier models (e.g., MiniMax M2.1 229B and Qwen3-Coder-Next 80B). Aurora also adapts effectively to distribution shifts in user traffic, delivering an additional 1.25x speedup over a well-trained but static speculator on widely used models (e.g., Qwen3 and Llama3).

When RL Meets Adaptive Speculative Training: A Unified Training-Serving System

TL;DR

Aurora introduces a unified training-serving system that treats online speculator learning as an asynchronous reinforcement-learning problem, enabling day-0 deployment and continuous improvement from live inference traces. By coupling an SGLang-based inference server with an asynchronous training server and using lazy synchronization, Aurora closes the training-serving loop, mitigating deployment lag and domain drift in speculative decoding. Key contributions include a practical architecture, an online RL objective with acceptance and discard signals, and showing that simple online finetuning yields substantial performance gains across frontier models, with additional adaptation benefits under distribution shifts. The approach reduces infrastructure overhead by avoiding large offline activation distillation, enables rapid adaptation to traffic, and demonstrates scalable speedups on large models, making speculative decoding more practical for production deployments.

Abstract

Speculative decoding can significantly accelerate LLM serving, yet most deployments today disentangle speculator training from serving, treating speculator training as a standalone offline modeling problem. We show that this decoupled formulation introduces substantial deployment and adaptation lag: (1) high time-to-serve, since a speculator must be trained offline for a considerable period before deployment; (2) delayed utility feedback, since the true end-to-end decoding speedup is only known after training and cannot be inferred reliably from acceptance rate alone due to model-architecture and system-level overheads; and (3) domain-drift degradation, as the target model is repurposed to new domains and the speculator becomes stale and less effective. To address these issues, we present Aurora, a unified training-serving system that closes the loop by continuously learning a speculator directly from live inference traces. Aurora reframes online speculator learning as an asynchronous reinforcement-learning problem: accepted tokens provide positive feedback, while rejected speculator proposals provide implicit negative feedback that we exploit to improve sample efficiency. Our design integrates an SGLang-based inference server with an asynchronous training server, enabling hot-swapped speculator updates without service interruption. Crucially, Aurora supports day-0 deployment: a speculator can be served immediately and rapidly adapted to live traffic, improving system performance while providing immediate utility feedback. Across experiments, Aurora achieves a 1.5x day-0 speedup on recently released frontier models (e.g., MiniMax M2.1 229B and Qwen3-Coder-Next 80B). Aurora also adapts effectively to distribution shifts in user traffic, delivering an additional 1.25x speedup over a well-trained but static speculator on widely used models (e.g., Qwen3 and Llama3).
Paper Structure (33 sections, 7 equations, 10 figures, 5 tables)

This paper contains 33 sections, 7 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Aurora. A unified training–serving framework for online speculative training with asynchronous, RL-style updates. A production inference server performs speculative decoding with a fixed target (verifier) and a lightweight draft model (speculator), accepting or rejecting proposed tokens during verification. Serving traces—including both accepted and rejected prefixes—are streamed into a data buffer and training pipeline. A separate training server continuously updates the speculator from collected off-policy data and periodically hot-swaps asynchronous model updates into the inference server without interrupting requests. Bottom: (left) per-request throughput over time, exhibiting step changes after each asynchronous update; (right) acceptance-length statistics during continuous training, showing improving (or sustained) acceptance over time.
  • Figure 2: Illustration of the Tree Attention mechanism. It enables efficient batched computation over the entire speculative tree, including both accepted (green) and rejected (red) tokens.
  • Figure 3: Mixed streams. Day-0 adaptation of an untrained speculator. (a) The acceptance length starts at one and rapidly increases, converging with the pretrained baseline. (b) The per-request throughput, defined as $(T_{\text{input}} + T_{\text{output}}) / t_{\text{request}}$ where $T_{\text{input}}$ and $T_{\text{output}}$ are the input and output token counts and $t_{\text{request}}$ is the end-to-end latency, initially suffers but recovers as the speculator adapts, demonstrating the effectiveness of the serve-to-train flywheel. (c) Continuing fine-tuning on top of the trained model achieves even better results.
  • Figure 4: Ordered streams. Day-0 adaptation of an untrained speculator. (a) The acceptance length starts at one and rapidly increases, converging and sometimes even surpassing the pretrained baseline. (b) The throughput(see definition in Section \ref{['sub:infra']}) initially suffers but recovers as the speculator adapts, demonstrating the effectiveness of the serve-to-train flywheel. (c)Continuing fine-tuning on top of the trained model drops at first but achieves better results after some training.
  • Figure 5: A Study of Speculator Asynchronization Policy. More frequent policy refresh improves post-shift adaptation (higher acceptance length) but can reduce serving throughput due to synchronization overhead. A moderately lazy schedule (e.g., Trained w Async 48) provides a strong Pareto point, preserving throughput while retaining most of the adaptation benefit.
  • ...and 5 more figures