TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling

Junyi Chen; Chuheng Du; Renyuan Liu; Shuochao Yao; Dingtian Yan; Jiang Liao; Shengzhong Liu; Fan Wu; Guihai Chen

TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling

Junyi Chen, Chuheng Du, Renyuan Liu, Shuochao Yao, Dingtian Yan, Jiang Liao, Shengzhong Liu, Fan Wu, Guihai Chen

TL;DR

TokenFlow tackles real-time LLM text streaming under bursty demand by tightly integrating a buffer-aware, preemptive request scheduler with a proactive, hierarchical KV cache manager that moves data between GPU and CPU while overlapping I/O with computation. The approach optimizes a QoS objective that weighs token usefulness, startup latency, and playback stability, rather than chasing pure token throughput. Empirical results show up to 82.5% improvement in effective throughput and up to 80.2% TTFT reduction across diverse models and GPUs, with modest scheduling overhead and strong ablation evidence for the memory-management contributions. The work offers a practical, single-node solution with clear pathways to multi-node extensions, delivering more robust, user-centric streaming for real-time LLM applications.

Abstract

Real-time LLM interactions demand streamed token generations, where text tokens are progressively generated and delivered to users while balancing two objectives: responsiveness (i.e., low time-to-first-token) and steady generation (i.e.,required time-between-tokens). Standard LLM serving systems suffer from the inflexibility caused by non-preemptive request scheduling and reactive memory management, leading to poor resource utilization and low request processing parallelism under request bursts. Therefore, we present TokenFlow, a novel LLM serving system with enhanced text streaming performance via preemptive request scheduling and proactive key-value (KV) cache management. TokenFlow dynamically prioritizes requests based on real-time token buffer occupancy and token consumption rate, while actively transferring KV cache between GPU and CPU memory in the background and overlapping I/O with computation to minimize request preemption overhead. Extensive experiments on Llama3-8B and Qwen2.5-32B across multiple GPUs (RTX 4090, A6000, H200) demonstrate that TokenFlow achieves up to 82.5% higher effective throughput (accounting for actual user consumption) while reducing P99 TTFT by up to 80.2%, without degrading overall token throughput.

TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling

TL;DR

Abstract

TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (23)