Table of Contents
Fetching ...

DSDE: Dynamic Speculative Decoding with KLD Stability for Real-World Serving

Mingyu Yang, Jae-Young Choi, Kihyo Moon, Minsung Jang, Eunjoo Jeon

TL;DR

This paper tackles latency in real-world large-batch LLM serving by moving beyond fixed speculation lengths. It introduces Dynamic Speculative Decoding Engine (DSDE), a training-free framework that relies on post-hoc signals derived from the variance of the Kullback–Leibler divergence, $D_{KL}$, to gauge regional stability and guide per-sequence speculation length with an adaptive cap $SL_{cap}$. The approach is integrated into vLLM and demonstrated across diverse model pairs and datasets, achieving latency on par with static-opt baselines and AdaEDL while offering superior robustness, particularly in high-divergence, low-acceptance-rate regimes. The results substantiate post-hoc stability signals as a practical component for robust, efficient LLM inference in real-world serving scenarios, and point to future work on richer feature sets and integration with advanced execution graphs.

Abstract

Speculative decoding accelerates large language model inference, but its reliance on a fixed speculation length is suboptimal in large-batch serving environments with diverse requests. This paper explores a new direction for dynamic adaptation by investigating a novel class of post-hoc, diagnostic signals. We propose Dynamic Speculative Decoding Engine (DSDE), a training-free framework built on two primary components: (1) a predictive signal based on the variance of the Kullback-Leibler (KLD) divergence, which diagnoses the generation's regional stability, and (2) an adaptive speculation length cap to mitigate the straggler problem in per-sequence decoding. Experiments demonstrate the potential of using KLD-based stability signals for dynamic adaptation. An algorithm guided by these signals achieves end-to-end latency competitive with leading baselines and exhibits superior robustness across diverse workloads. This robustness is particularly valuable in challenging low-acceptance-rate regimes, where the proposed signal maintains its diagnostic utility. Collectively, these findings validate post-hoc signals as a valuable component for building more robust and intelligent LLM inference systems, and highlight a promising direction for future research on dynamic speculation length adaptation.

DSDE: Dynamic Speculative Decoding with KLD Stability for Real-World Serving

TL;DR

This paper tackles latency in real-world large-batch LLM serving by moving beyond fixed speculation lengths. It introduces Dynamic Speculative Decoding Engine (DSDE), a training-free framework that relies on post-hoc signals derived from the variance of the Kullback–Leibler divergence, , to gauge regional stability and guide per-sequence speculation length with an adaptive cap . The approach is integrated into vLLM and demonstrated across diverse model pairs and datasets, achieving latency on par with static-opt baselines and AdaEDL while offering superior robustness, particularly in high-divergence, low-acceptance-rate regimes. The results substantiate post-hoc stability signals as a practical component for robust, efficient LLM inference in real-world serving scenarios, and point to future work on richer feature sets and integration with advanced execution graphs.

Abstract

Speculative decoding accelerates large language model inference, but its reliance on a fixed speculation length is suboptimal in large-batch serving environments with diverse requests. This paper explores a new direction for dynamic adaptation by investigating a novel class of post-hoc, diagnostic signals. We propose Dynamic Speculative Decoding Engine (DSDE), a training-free framework built on two primary components: (1) a predictive signal based on the variance of the Kullback-Leibler (KLD) divergence, which diagnoses the generation's regional stability, and (2) an adaptive speculation length cap to mitigate the straggler problem in per-sequence decoding. Experiments demonstrate the potential of using KLD-based stability signals for dynamic adaptation. An algorithm guided by these signals achieves end-to-end latency competitive with leading baselines and exhibits superior robustness across diverse workloads. This robustness is particularly valuable in challenging low-acceptance-rate regimes, where the proposed signal maintains its diagnostic utility. Collectively, these findings validate post-hoc signals as a valuable component for building more robust and intelligent LLM inference systems, and highlight a promising direction for future research on dynamic speculation length adaptation.

Paper Structure

This paper contains 20 sections, 11 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Two strategies for speculative decoding: per-batch decoding with a static speculation length (top, $SL=3$) and per-sequence decoding with adaptive speculation lengths (bottom).
  • Figure 2: Iteration-level fluctuation of the optimal speculation length, highlighting the challenge for dynamic prediction.
  • Figure 3: Straggler problem in a per-sequence approach.
  • Figure 4: Workflow of the dynamic Speculative Length (SL) system. After the standard speculative decoding steps involving the Draft worker (1), Target worker (2), and Rejection sampler (3), the SL adapter (4) computes the next SL, which informs the Lookahead scheduler for the subsequent decoding round.
  • Figure 5: An illustration of the data collection process for calculating the WVIR. At each step i, the KLD values from the previous verification steps are aggregated to form distinct short-term and long-term historical windows.
  • ...and 4 more figures