Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

Jingyu Liu; Beidi Chen; Ce Zhang

Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

Jingyu Liu, Beidi Chen, Ce Zhang

TL;DR

SpecPrefill addresses the TTFT bottleneck in LLM inference by using a lightweight, training-free speculator to drop non-essential prompt tokens during prefill. The method estimates token importance via look-ahead attention aggregation, chunk-based denoising, and position-id restoration, and can be paired with speculative decoding for further gains. Across long-context benchmarks (LongBench) and synthetic tasks (RULER), as well as standard short-task evaluations, SpecPrefill demonstrates substantial TTFT and QPS improvements with modest accuracy loss and no fine-tuning. The approach is compatible with existing serving stacks (e.g., vLLM) and can be combined with quantization and KV-cache strategies to enable practical, large-scale LLM serving for applications requiring fast responses and long contexts.

Abstract

Improving time-to-first-token (TTFT) is an essentially important objective in modern large language model (LLM) inference engines. Optimizing TTFT directly results in higher maximal QPS and meets the requirements of many critical applications. However, boosting TTFT is notoriously challenging since it is compute-bounded and the performance bottleneck shifts from the self-attention that many prior works focus on to the MLP part. In this work, we present SpecPrefill, a training free framework that accelerates the inference TTFT for both long and medium context queries based on the following insight: LLMs are generalized enough to preserve the quality given only a carefully chosen subset of prompt tokens. At its core, SpecPrefill leverages a lightweight model to speculate locally important tokens based on the context. These tokens, along with the necessary positional information, are then sent to the main model for processing. We evaluate SpecPrefill with a diverse set of tasks, followed by a comprehensive benchmarking of performance improvement both in a real end-to-end setting and ablation studies. SpecPrefill manages to serve Llama-3.1-405B-Instruct-FP8 with up to 7$\times$ maximal end-to-end QPS on real downstream tasks and 7.66$\times$ TTFT improvement.

Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

TL;DR

Abstract

Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)