Table of Contents
Fetching ...

TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling

Junyi Chen, Chuheng Du, Renyuan Liu, Shuochao Yao, Dingtian Yan, Jiang Liao, Shengzhong Liu, Fan Wu, Guihai Chen

TL;DR

TokenFlow tackles real-time LLM text streaming under bursty demand by tightly integrating a buffer-aware, preemptive request scheduler with a proactive, hierarchical KV cache manager that moves data between GPU and CPU while overlapping I/O with computation. The approach optimizes a QoS objective that weighs token usefulness, startup latency, and playback stability, rather than chasing pure token throughput. Empirical results show up to 82.5% improvement in effective throughput and up to 80.2% TTFT reduction across diverse models and GPUs, with modest scheduling overhead and strong ablation evidence for the memory-management contributions. The work offers a practical, single-node solution with clear pathways to multi-node extensions, delivering more robust, user-centric streaming for real-time LLM applications.

Abstract

Real-time LLM interactions demand streamed token generations, where text tokens are progressively generated and delivered to users while balancing two objectives: responsiveness (i.e., low time-to-first-token) and steady generation (i.e.,required time-between-tokens). Standard LLM serving systems suffer from the inflexibility caused by non-preemptive request scheduling and reactive memory management, leading to poor resource utilization and low request processing parallelism under request bursts. Therefore, we present TokenFlow, a novel LLM serving system with enhanced text streaming performance via preemptive request scheduling and proactive key-value (KV) cache management. TokenFlow dynamically prioritizes requests based on real-time token buffer occupancy and token consumption rate, while actively transferring KV cache between GPU and CPU memory in the background and overlapping I/O with computation to minimize request preemption overhead. Extensive experiments on Llama3-8B and Qwen2.5-32B across multiple GPUs (RTX 4090, A6000, H200) demonstrate that TokenFlow achieves up to 82.5% higher effective throughput (accounting for actual user consumption) while reducing P99 TTFT by up to 80.2%, without degrading overall token throughput.

TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling

TL;DR

TokenFlow tackles real-time LLM text streaming under bursty demand by tightly integrating a buffer-aware, preemptive request scheduler with a proactive, hierarchical KV cache manager that moves data between GPU and CPU while overlapping I/O with computation. The approach optimizes a QoS objective that weighs token usefulness, startup latency, and playback stability, rather than chasing pure token throughput. Empirical results show up to 82.5% improvement in effective throughput and up to 80.2% TTFT reduction across diverse models and GPUs, with modest scheduling overhead and strong ablation evidence for the memory-management contributions. The work offers a practical, single-node solution with clear pathways to multi-node extensions, delivering more robust, user-centric streaming for real-time LLM applications.

Abstract

Real-time LLM interactions demand streamed token generations, where text tokens are progressively generated and delivered to users while balancing two objectives: responsiveness (i.e., low time-to-first-token) and steady generation (i.e.,required time-between-tokens). Standard LLM serving systems suffer from the inflexibility caused by non-preemptive request scheduling and reactive memory management, leading to poor resource utilization and low request processing parallelism under request bursts. Therefore, we present TokenFlow, a novel LLM serving system with enhanced text streaming performance via preemptive request scheduling and proactive key-value (KV) cache management. TokenFlow dynamically prioritizes requests based on real-time token buffer occupancy and token consumption rate, while actively transferring KV cache between GPU and CPU memory in the background and overlapping I/O with computation to minimize request preemption overhead. Extensive experiments on Llama3-8B and Qwen2.5-32B across multiple GPUs (RTX 4090, A6000, H200) demonstrate that TokenFlow achieves up to 82.5% higher effective throughput (accounting for actual user consumption) while reducing P99 TTFT by up to 80.2%, without degrading overall token throughput.

Paper Structure

This paper contains 36 sections, 8 equations, 23 figures, 2 tables.

Figures (23)

  • Figure 1: We summarize the token consumption speeds for reading (left) and for listening (right), measured across different age groups and language users. The data is derived from calculations based on reading speed data from NIH liu2017age and information on token counting from OpenAI's blog openai_tokens.
  • Figure 2: Micro-benchmark on SGLang's burst request handling conducted on the single NVIDIA H200 GPU. Left: Time-to-First-Token (TTFT) surges beyond acceptable thresholds (1.3s, red line) under increasing request intensity. Right: Generation speed declines but remains excessively high ($2\times$ average reading speed for reference, red line).
  • Figure 3: Overview of TokenFlow: Detailed breakdown of all modules and their components.
  • Figure 4: High-level workflow of TokenFlow. Modules newly added by TokenFlow are colored green.
  • Figure 5: Three QoS factors: (a) Startup latency, (b) User stall events, (c) Token usefulness.
  • ...and 18 more figures