AdaSpec: Adaptive Speculative Decoding for Fast, SLO-Aware Large Language Model Serving
Kaiyu Huang, Hao Wu, Zhubo Shi, Han Zou, Minchen Yu, Qingjiang Shi
TL;DR
AdaSpec tackles the challenge of meeting SLOs for cloud LLM inference under dynamic workloads by making speculative decoding adaptive. It introduces an efficiency model and three modules—adaptive drafter, confidence prior verifier, and SLO-aware efficiency estimator—to adjust speculative length at both batch and per-request levels. Empirical results on real-world traces show up to 66% speedup over state-of-the-art speculative systems while maintaining high SLO attainment, and ablation confirms the value of fine-grained control. The approach advances practical LLM serving by balancing throughput and reliability across diverse hardware and workload patterns.
Abstract
Cloud-based Large Language Model (LLM) services often face challenges in achieving low inference latency and meeting Service Level Objectives (SLOs) under dynamic request patterns. Speculative decoding, which exploits lightweight models for drafting and LLMs for verification, has emerged as a compelling technique to accelerate LLM inference. However, existing speculative decoding solutions often fail to adapt to fluctuating workloads and dynamic system environments, resulting in impaired performance and SLO violations. In this paper, we introduce AdaSpec, an efficient LLM inference system that dynamically adjusts speculative strategies according to real-time request loads and system configurations. AdaSpec proposes a theoretical model to analyze and predict the efficiency of speculative strategies across diverse scenarios. Additionally, it implements intelligent drafting and verification algorithms to maximize performance while ensuring high SLO attainment. Experimental results on real-world LLM service traces demonstrate that AdaSpec consistently meets SLOs and achieves substantial performance improvements, delivering up to 66% speedup compared to state-of-the-art speculative inference systems. The source code is publicly available at https://github.com/cerebellumking/AdaSpec
