Table of Contents
Fetching ...

TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput

Xiaoxuan Liu, Jongseok Park, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Chen Zhang, Kuntai Du, Xiangxi Mo, Kaichao You, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang

TL;DR

TurboSpec introduces a closed-loop control system that automatically profiles the LLM serving environment and uses goodput as the guiding metric to adapt intra-request speculation and inter-request batching. By offline profiling latency and online feedback from token acceptance, it predicts the most beneficial speculation length per generation step and integrates with vLLM to balance draft and target model work. The approach yields robust improvements across static and dynamic workloads, achieving up to several-fold latency reductions while maintaining stable performance under high load and distribution shifts. The work also demonstrates practical integration techniques, including KV-cache handling, CUDA Graph acceleration, and an offline-online adaptation pipeline, enabling robust deployment in real-world LLM serving.

Abstract

Large Language Model (LLM) serving systems batch concurrent user requests to achieve efficient serving. However, in real-world deployments, such inter-request parallelism from batching is often limited by external factors such as low request rates or memory constraints. Recent works focus on intra-request parallelism from speculative decoding as a solution to this problem. Unfortunately, benefits from intra-request parallelism are often fragile, as speculative decoding causes overhead, and speculated tokens may miss. We observe that speculative decoding may degrade LLM serving performance if added naively without tuning to the incoming requests and the speculation method. To alleviate the need for expert tuning and make speculative decoding more robust, we present TurboSpec, a speculation control system that automatically profiles the execution environment and utilizes a feedback-based algorithm to dynamically adjust the amount of intra-request parallelism in LLM serving. TurboSpec predicts "goodput" - the amount of successfully generated tokens - to evaluate and adjust intra-request parallelism amount to that with the highest goodput in runtime. We implement TurboSpec on a real-world LLM serving system vLLM and demonstrate its effectiveness across diverse workloads and hardware configurations, providing consistent performance improvements across all test scenarios.

TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput

TL;DR

TurboSpec introduces a closed-loop control system that automatically profiles the LLM serving environment and uses goodput as the guiding metric to adapt intra-request speculation and inter-request batching. By offline profiling latency and online feedback from token acceptance, it predicts the most beneficial speculation length per generation step and integrates with vLLM to balance draft and target model work. The approach yields robust improvements across static and dynamic workloads, achieving up to several-fold latency reductions while maintaining stable performance under high load and distribution shifts. The work also demonstrates practical integration techniques, including KV-cache handling, CUDA Graph acceleration, and an offline-online adaptation pipeline, enabling robust deployment in real-world LLM serving.

Abstract

Large Language Model (LLM) serving systems batch concurrent user requests to achieve efficient serving. However, in real-world deployments, such inter-request parallelism from batching is often limited by external factors such as low request rates or memory constraints. Recent works focus on intra-request parallelism from speculative decoding as a solution to this problem. Unfortunately, benefits from intra-request parallelism are often fragile, as speculative decoding causes overhead, and speculated tokens may miss. We observe that speculative decoding may degrade LLM serving performance if added naively without tuning to the incoming requests and the speculation method. To alleviate the need for expert tuning and make speculative decoding more robust, we present TurboSpec, a speculation control system that automatically profiles the execution environment and utilizes a feedback-based algorithm to dynamically adjust the amount of intra-request parallelism in LLM serving. TurboSpec predicts "goodput" - the amount of successfully generated tokens - to evaluate and adjust intra-request parallelism amount to that with the highest goodput in runtime. We implement TurboSpec on a real-world LLM serving system vLLM and demonstrate its effectiveness across diverse workloads and hardware configurations, providing consistent performance improvements across all test scenarios.
Paper Structure (56 sections, 8 equations, 34 figures, 10 tables)

This paper contains 56 sections, 8 equations, 34 figures, 10 tables.

Figures (34)

  • Figure 1: Comparison of (a) inter-request parallelism from batching, limited by external factors such as number of requests or VRAM, and (b) intra-request parallelism from speculative decoding, limited by stochastic factors such as the speculation method or the speculation difficulty of the request. TurboSpec evaluates the limitations of (a) and (b) and adaptively combines them to efficiently leverage the larger product space of both parallelisms, as shown in (c).
  • Figure 2: Single-step execution latency using vLLM vllm2024serving with Llama-2 7B model as the target model and Llama-2 160M model as the draft model on a H100 GPU. Blue triangle represents the latency without speculative decoding. Latencies are fitted from sample points, represented by the dots.
  • Figure 3: Two generation steps of draft-model based TurboSpec execution. The proposed tokens from the draft model are sent to the target model for scoring in a single forward pass, allowing the generation of more than one token for each request. TurboSpec online adaptor adjusts the proposed length and verification length dynamically for each step using goodput.
  • Figure 4: Example of different proposed and verification lengths in prompt lookup decoding across multiple requests. $x_{10}, x_{20}, x_{30}$ are input tokens for requests $R_1, R_2, R_3$ respectively. Successful matches are identified only for requests $R_1$ and $R_3$, resulting in proposals for these requests.
  • Figure 5: Normalized goodput across proposed lengths. We normalize goodput by expressing it as a fraction of the maximum goodput achieved across all proposed lengths in the same configuration (fix batch size or acceptance rate).
  • ...and 29 more figures