Table of Contents
Fetching ...

Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

Jingyu Liu, Beidi Chen, Ce Zhang

TL;DR

SpecPrefill addresses the TTFT bottleneck in LLM inference by using a lightweight, training-free speculator to drop non-essential prompt tokens during prefill. The method estimates token importance via look-ahead attention aggregation, chunk-based denoising, and position-id restoration, and can be paired with speculative decoding for further gains. Across long-context benchmarks (LongBench) and synthetic tasks (RULER), as well as standard short-task evaluations, SpecPrefill demonstrates substantial TTFT and QPS improvements with modest accuracy loss and no fine-tuning. The approach is compatible with existing serving stacks (e.g., vLLM) and can be combined with quantization and KV-cache strategies to enable practical, large-scale LLM serving for applications requiring fast responses and long contexts.

Abstract

Improving time-to-first-token (TTFT) is an essentially important objective in modern large language model (LLM) inference engines. Optimizing TTFT directly results in higher maximal QPS and meets the requirements of many critical applications. However, boosting TTFT is notoriously challenging since it is compute-bounded and the performance bottleneck shifts from the self-attention that many prior works focus on to the MLP part. In this work, we present SpecPrefill, a training free framework that accelerates the inference TTFT for both long and medium context queries based on the following insight: LLMs are generalized enough to preserve the quality given only a carefully chosen subset of prompt tokens. At its core, SpecPrefill leverages a lightweight model to speculate locally important tokens based on the context. These tokens, along with the necessary positional information, are then sent to the main model for processing. We evaluate SpecPrefill with a diverse set of tasks, followed by a comprehensive benchmarking of performance improvement both in a real end-to-end setting and ablation studies. SpecPrefill manages to serve Llama-3.1-405B-Instruct-FP8 with up to 7$\times$ maximal end-to-end QPS on real downstream tasks and 7.66$\times$ TTFT improvement.

Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

TL;DR

SpecPrefill addresses the TTFT bottleneck in LLM inference by using a lightweight, training-free speculator to drop non-essential prompt tokens during prefill. The method estimates token importance via look-ahead attention aggregation, chunk-based denoising, and position-id restoration, and can be paired with speculative decoding for further gains. Across long-context benchmarks (LongBench) and synthetic tasks (RULER), as well as standard short-task evaluations, SpecPrefill demonstrates substantial TTFT and QPS improvements with modest accuracy loss and no fine-tuning. The approach is compatible with existing serving stacks (e.g., vLLM) and can be combined with quantization and KV-cache strategies to enable practical, large-scale LLM serving for applications requiring fast responses and long contexts.

Abstract

Improving time-to-first-token (TTFT) is an essentially important objective in modern large language model (LLM) inference engines. Optimizing TTFT directly results in higher maximal QPS and meets the requirements of many critical applications. However, boosting TTFT is notoriously challenging since it is compute-bounded and the performance bottleneck shifts from the self-attention that many prior works focus on to the MLP part. In this work, we present SpecPrefill, a training free framework that accelerates the inference TTFT for both long and medium context queries based on the following insight: LLMs are generalized enough to preserve the quality given only a carefully chosen subset of prompt tokens. At its core, SpecPrefill leverages a lightweight model to speculate locally important tokens based on the context. These tokens, along with the necessary positional information, are then sent to the main model for processing. We evaluate SpecPrefill with a diverse set of tasks, followed by a comprehensive benchmarking of performance improvement both in a real end-to-end setting and ablation studies. SpecPrefill manages to serve Llama-3.1-405B-Instruct-FP8 with up to 7 maximal end-to-end QPS on real downstream tasks and 7.66 TTFT improvement.

Paper Structure

This paper contains 32 sections, 3 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Speculative Prefill QPS Improvement: In an end-to-end server-client setting with real world datasets, we benchmark the average query latency under a given fixed timeout when sending queries at a constant QPS. SpecPrefill significantly improves the maximum QPS supported by the vLLM server as well as the latency compared to not using it. When we reach low keep rate, we can even serve the 405B model with SpecPrefill to run more efficiently than the 70B model. As the base model size increases and keep rate drops, we can get 7$\times$ end-to-end QPS boost while only occurring $<5\%$ accuracy.
  • Figure 2: LongBench Main Result on Llama 405B: In this figure, we showcase the effectiveness of SpecPrefill on LongBench, which consists of six categories of long context downstream tasks. In each plot, the dash lines are the results of baseline Llama-3.1-405B-Instruct-FP8 for each subtask and we benchmark SpecPrefill with increasing token keep rates. We observe different behaviors such as quality preservation, degradation, and improvement based on the task type.
  • Figure 3: SpecPrefill TTFT Improvement: We present prefill TTFT speed-up using SpecPrefill under different settings over Llama-3.1-70B-Instruct and Llama-3.1-405B-Instruct-FP8 (achieving up to 7.66x faster TTFT when keeping $10\%$ tokens for the 405B model).
  • Figure 4: SpecPrefill with look-ahead TTFT Improvement: Complimentary to Figure \ref{['fig:efficiency']}, we also show the relative speedup when using a look-ahead = 8 steps for both the 70B and 405B model.
  • Figure 5: SpecPrefill v.s. MInference TTFT on 70B Models: The superiority of SpecPrefill becomes more clear as we increase the batch size under 128K context lengths and MInference gradually improves as the context length increases with smaller batch size due to less overhead. Percentages in parenthesis are the relative average scores to that of MInference on LongBench.
  • ...and 5 more figures