DSDE: Dynamic Speculative Decoding with KLD Stability for Real-World Serving
Mingyu Yang, Jae-Young Choi, Kihyo Moon, Minsung Jang, Eunjoo Jeon
TL;DR
This paper tackles latency in real-world large-batch LLM serving by moving beyond fixed speculation lengths. It introduces Dynamic Speculative Decoding Engine (DSDE), a training-free framework that relies on post-hoc signals derived from the variance of the Kullback–Leibler divergence, $D_{KL}$, to gauge regional stability and guide per-sequence speculation length with an adaptive cap $SL_{cap}$. The approach is integrated into vLLM and demonstrated across diverse model pairs and datasets, achieving latency on par with static-opt baselines and AdaEDL while offering superior robustness, particularly in high-divergence, low-acceptance-rate regimes. The results substantiate post-hoc stability signals as a practical component for robust, efficient LLM inference in real-world serving scenarios, and point to future work on richer feature sets and integration with advanced execution graphs.
Abstract
Speculative decoding accelerates large language model inference, but its reliance on a fixed speculation length is suboptimal in large-batch serving environments with diverse requests. This paper explores a new direction for dynamic adaptation by investigating a novel class of post-hoc, diagnostic signals. We propose Dynamic Speculative Decoding Engine (DSDE), a training-free framework built on two primary components: (1) a predictive signal based on the variance of the Kullback-Leibler (KLD) divergence, which diagnoses the generation's regional stability, and (2) an adaptive speculation length cap to mitigate the straggler problem in per-sequence decoding. Experiments demonstrate the potential of using KLD-based stability signals for dynamic adaptation. An algorithm guided by these signals achieves end-to-end latency competitive with leading baselines and exhibits superior robustness across diverse workloads. This robustness is particularly valuable in challenging low-acceptance-rate regimes, where the proposed signal maintains its diagnostic utility. Collectively, these findings validate post-hoc signals as a valuable component for building more robust and intelligent LLM inference systems, and highlight a promising direction for future research on dynamic speculation length adaptation.
