Table of Contents
Fetching ...

Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency

Ruixiao Li, Fahao Chen, Peng Li

TL;DR

The paper tackles latency minimization for LLM inference using speculative decoding, where total time depends on both final output length and the token acceptance rate of speculative tokens. It introduces LAPS-SD, a semi-clairvoyant scheduler that uses multiple priority queues to preempt uncertain requests early (LAS-style) and later transitions to a semi-clairvoyant, SJF-like schedule once acceptance rates stabilize, aided by predictions of output length $L_i$ and acceptance rate $A_i$ and an execution-time estimate $ ilde{T}_i = rac{n L_i T_{SSM}}{n A_i + 1} + rac{L_i T_{LLM}}{n A_i + 1}$. The approach balances preemption overhead with accurate timing, delivering about a 39% reduction in average inference latency across three datasets and demonstrating manageable estimation error (overall $6.84\%$). Empirical results on a modern GPU platform validate the effectiveness of the proposed inter- and intra-queue design, the stability-based perceptible state, and the practical benefits for speculative decoding in LLM serving systems.

Abstract

Speculative decoding accelerates Large Language Model (LLM) inference by employing a small speculative model (SSM) to generate multiple candidate tokens and verify them using the LLM in parallel. This technique has been widely integrated into LLM inference serving systems. However, inference requests typically exhibit uncertain execution time, which poses a significant challenge of efficiently scheduling requests in these systems. Existing work estimates execution time based solely on predicted output length, which could be inaccurate because execution time depends on both output length and token acceptance rate of verification by the LLM. In this paper, we propose a semi-clairvoyant request scheduling algorithm called Least-Attained/Perceived-Service for Speculative Decoding (LAPS-SD). Given a number of inference requests, LAPS-SD can effectively minimize average inference latency by adaptively scheduling requests according to their features during decoding. When the token acceptance rate is dynamic and execution time is difficult to estimate, LAPS-SD maintains multiple priority queues and allows request execution preemption across different queues. Once the token acceptance rate becomes stable, LAPS-SD can accurately estimate the execution time and schedule requests accordingly. Extensive experiments show that LAPS-SD reduces inference latency by approximately 39\% compared to state-of-the-art scheduling methods.

Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency

TL;DR

The paper tackles latency minimization for LLM inference using speculative decoding, where total time depends on both final output length and the token acceptance rate of speculative tokens. It introduces LAPS-SD, a semi-clairvoyant scheduler that uses multiple priority queues to preempt uncertain requests early (LAS-style) and later transitions to a semi-clairvoyant, SJF-like schedule once acceptance rates stabilize, aided by predictions of output length and acceptance rate and an execution-time estimate . The approach balances preemption overhead with accurate timing, delivering about a 39% reduction in average inference latency across three datasets and demonstrating manageable estimation error (overall ). Empirical results on a modern GPU platform validate the effectiveness of the proposed inter- and intra-queue design, the stability-based perceptible state, and the practical benefits for speculative decoding in LLM serving systems.

Abstract

Speculative decoding accelerates Large Language Model (LLM) inference by employing a small speculative model (SSM) to generate multiple candidate tokens and verify them using the LLM in parallel. This technique has been widely integrated into LLM inference serving systems. However, inference requests typically exhibit uncertain execution time, which poses a significant challenge of efficiently scheduling requests in these systems. Existing work estimates execution time based solely on predicted output length, which could be inaccurate because execution time depends on both output length and token acceptance rate of verification by the LLM. In this paper, we propose a semi-clairvoyant request scheduling algorithm called Least-Attained/Perceived-Service for Speculative Decoding (LAPS-SD). Given a number of inference requests, LAPS-SD can effectively minimize average inference latency by adaptively scheduling requests according to their features during decoding. When the token acceptance rate is dynamic and execution time is difficult to estimate, LAPS-SD maintains multiple priority queues and allows request execution preemption across different queues. Once the token acceptance rate becomes stable, LAPS-SD can accurately estimate the execution time and schedule requests accordingly. Extensive experiments show that LAPS-SD reduces inference latency by approximately 39\% compared to state-of-the-art scheduling methods.

Paper Structure

This paper contains 17 sections, 3 equations, 7 figures, 1 algorithm.

Figures (7)

  • Figure 1: The illustration depicts different scheduling algorithms for speculative decoding requests. The generation context is represented by squares with colors ($\blacksquare$$\blacksquare$$\blacksquare$), while the speculative context is represented by squares with stripes.
  • Figure 2: The ratio of switching costs to the inference time of requests with different output lengths.
  • Figure 3: The average acceptance rate of three example requests over the speculative decoding process.
  • Figure 4: The queue structure in the proposed scheduling algorithm.
  • Figure 5: The average inference latency with different scheduling algorithms.
  • ...and 2 more figures