Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency
Ruixiao Li, Fahao Chen, Peng Li
TL;DR
The paper tackles latency minimization for LLM inference using speculative decoding, where total time depends on both final output length and the token acceptance rate of speculative tokens. It introduces LAPS-SD, a semi-clairvoyant scheduler that uses multiple priority queues to preempt uncertain requests early (LAS-style) and later transitions to a semi-clairvoyant, SJF-like schedule once acceptance rates stabilize, aided by predictions of output length $L_i$ and acceptance rate $A_i$ and an execution-time estimate $ ilde{T}_i = rac{n L_i T_{SSM}}{n A_i + 1} + rac{L_i T_{LLM}}{n A_i + 1}$. The approach balances preemption overhead with accurate timing, delivering about a 39% reduction in average inference latency across three datasets and demonstrating manageable estimation error (overall $6.84\%$). Empirical results on a modern GPU platform validate the effectiveness of the proposed inter- and intra-queue design, the stability-based perceptible state, and the practical benefits for speculative decoding in LLM serving systems.
Abstract
Speculative decoding accelerates Large Language Model (LLM) inference by employing a small speculative model (SSM) to generate multiple candidate tokens and verify them using the LLM in parallel. This technique has been widely integrated into LLM inference serving systems. However, inference requests typically exhibit uncertain execution time, which poses a significant challenge of efficiently scheduling requests in these systems. Existing work estimates execution time based solely on predicted output length, which could be inaccurate because execution time depends on both output length and token acceptance rate of verification by the LLM. In this paper, we propose a semi-clairvoyant request scheduling algorithm called Least-Attained/Perceived-Service for Speculative Decoding (LAPS-SD). Given a number of inference requests, LAPS-SD can effectively minimize average inference latency by adaptively scheduling requests according to their features during decoding. When the token acceptance rate is dynamic and execution time is difficult to estimate, LAPS-SD maintains multiple priority queues and allows request execution preemption across different queues. Once the token acceptance rate becomes stable, LAPS-SD can accurately estimate the execution time and schedule requests accordingly. Extensive experiments show that LAPS-SD reduces inference latency by approximately 39\% compared to state-of-the-art scheduling methods.
