Table of Contents
Fetching ...

AdaSpec: Adaptive Speculative Decoding for Fast, SLO-Aware Large Language Model Serving

Kaiyu Huang, Hao Wu, Zhubo Shi, Han Zou, Minchen Yu, Qingjiang Shi

TL;DR

AdaSpec tackles the challenge of meeting SLOs for cloud LLM inference under dynamic workloads by making speculative decoding adaptive. It introduces an efficiency model and three modules—adaptive drafter, confidence prior verifier, and SLO-aware efficiency estimator—to adjust speculative length at both batch and per-request levels. Empirical results on real-world traces show up to 66% speedup over state-of-the-art speculative systems while maintaining high SLO attainment, and ablation confirms the value of fine-grained control. The approach advances practical LLM serving by balancing throughput and reliability across diverse hardware and workload patterns.

Abstract

Cloud-based Large Language Model (LLM) services often face challenges in achieving low inference latency and meeting Service Level Objectives (SLOs) under dynamic request patterns. Speculative decoding, which exploits lightweight models for drafting and LLMs for verification, has emerged as a compelling technique to accelerate LLM inference. However, existing speculative decoding solutions often fail to adapt to fluctuating workloads and dynamic system environments, resulting in impaired performance and SLO violations. In this paper, we introduce AdaSpec, an efficient LLM inference system that dynamically adjusts speculative strategies according to real-time request loads and system configurations. AdaSpec proposes a theoretical model to analyze and predict the efficiency of speculative strategies across diverse scenarios. Additionally, it implements intelligent drafting and verification algorithms to maximize performance while ensuring high SLO attainment. Experimental results on real-world LLM service traces demonstrate that AdaSpec consistently meets SLOs and achieves substantial performance improvements, delivering up to 66% speedup compared to state-of-the-art speculative inference systems. The source code is publicly available at https://github.com/cerebellumking/AdaSpec

AdaSpec: Adaptive Speculative Decoding for Fast, SLO-Aware Large Language Model Serving

TL;DR

AdaSpec tackles the challenge of meeting SLOs for cloud LLM inference under dynamic workloads by making speculative decoding adaptive. It introduces an efficiency model and three modules—adaptive drafter, confidence prior verifier, and SLO-aware efficiency estimator—to adjust speculative length at both batch and per-request levels. Empirical results on real-world traces show up to 66% speedup over state-of-the-art speculative systems while maintaining high SLO attainment, and ablation confirms the value of fine-grained control. The approach advances practical LLM serving by balancing throughput and reliability across diverse hardware and workload patterns.

Abstract

Cloud-based Large Language Model (LLM) services often face challenges in achieving low inference latency and meeting Service Level Objectives (SLOs) under dynamic request patterns. Speculative decoding, which exploits lightweight models for drafting and LLMs for verification, has emerged as a compelling technique to accelerate LLM inference. However, existing speculative decoding solutions often fail to adapt to fluctuating workloads and dynamic system environments, resulting in impaired performance and SLO violations. In this paper, we introduce AdaSpec, an efficient LLM inference system that dynamically adjusts speculative strategies according to real-time request loads and system configurations. AdaSpec proposes a theoretical model to analyze and predict the efficiency of speculative strategies across diverse scenarios. Additionally, it implements intelligent drafting and verification algorithms to maximize performance while ensuring high SLO attainment. Experimental results on real-world LLM service traces demonstrate that AdaSpec consistently meets SLOs and achieves substantial performance improvements, delivering up to 66% speedup compared to state-of-the-art speculative inference systems. The source code is publicly available at https://github.com/cerebellumking/AdaSpec

Paper Structure

This paper contains 23 sections, 4 theorems, 10 equations, 13 figures, 2 tables, 3 algorithms.

Key Result

lemma 1

The total speculative decoding time $T_{sd}$ is a quadratic function of the speculative length $SL$:

Figures (13)

  • Figure 1: Different request patterns in real production traces.
  • Figure 2: A single-step speculative decoding process with a speculative length of 1. We primarily discuss the greedy sampling, where the token with the highest confidence score is selected as the output instead of the whole distribution.
  • Figure 3: Speedup across various speculative lengths under diverse request quantities and types. The red dashed line represents the speedup value of 1.
  • Figure 4: Speedup across different speculative lengths for various draft-target model pairs on diverse computing platforms.
  • Figure 5: Relationship between SLO attainment and end-to-end latency speedup under different speculative lengths. The red dashed line represents the speedup value of 1 and the SLO attainment value of 0.9.
  • ...and 8 more figures

Theorems & Definitions (8)

  • lemma 1
  • proof
  • lemma 2
  • proof
  • theorem 1
  • proof
  • corollary 1
  • proof