TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput
Xiaoxuan Liu, Jongseok Park, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Chen Zhang, Kuntai Du, Xiangxi Mo, Kaichao You, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang
TL;DR
TurboSpec introduces a closed-loop control system that automatically profiles the LLM serving environment and uses goodput as the guiding metric to adapt intra-request speculation and inter-request batching. By offline profiling latency and online feedback from token acceptance, it predicts the most beneficial speculation length per generation step and integrates with vLLM to balance draft and target model work. The approach yields robust improvements across static and dynamic workloads, achieving up to several-fold latency reductions while maintaining stable performance under high load and distribution shifts. The work also demonstrates practical integration techniques, including KV-cache handling, CUDA Graph acceleration, and an offline-online adaptation pipeline, enabling robust deployment in real-world LLM serving.
Abstract
Large Language Model (LLM) serving systems batch concurrent user requests to achieve efficient serving. However, in real-world deployments, such inter-request parallelism from batching is often limited by external factors such as low request rates or memory constraints. Recent works focus on intra-request parallelism from speculative decoding as a solution to this problem. Unfortunately, benefits from intra-request parallelism are often fragile, as speculative decoding causes overhead, and speculated tokens may miss. We observe that speculative decoding may degrade LLM serving performance if added naively without tuning to the incoming requests and the speculation method. To alleviate the need for expert tuning and make speculative decoding more robust, we present TurboSpec, a speculation control system that automatically profiles the execution environment and utilizes a feedback-based algorithm to dynamically adjust the amount of intra-request parallelism in LLM serving. TurboSpec predicts "goodput" - the amount of successfully generated tokens - to evaluate and adjust intra-request parallelism amount to that with the highest goodput in runtime. We implement TurboSpec on a real-world LLM serving system vLLM and demonstrate its effectiveness across diverse workloads and hardware configurations, providing consistent performance improvements across all test scenarios.
