Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference

Jian Tian; Shuailong Li; Yang Cao; Wenbo Cui; Minghan Zhu; Wenkang Wu; Jianming Zhang; Yanpeng Wang; Zhiwen Xiao; Zhenyu Hou; Dou Shen

Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference

Jian Tian, Shuailong Li, Yang Cao, Wenbo Cui, Minghan Zhu, Wenkang Wu, Jianming Zhang, Yanpeng Wang, Zhiwen Xiao, Zhenyu Hou, Dou Shen

TL;DR

The paper tackles scheduling inefficiencies in large-scale DP+EP LLM inference, where immediate dispatch causes device-side queuing and HOL blocking. It introduces Staggered Batch Scheduling (SBS), which buffers requests into batches to form near-optimal execution batches and provides a global view for load balancing across Prefill and Decode. Key contributions include an adaptive scheduling interval, robust state synchronization, a fine-grained capacity model with water-filling allocation for Prefill, and a dual-objective, IQR-informed Decode scheduling strategy with lexicographical selection. Experimental results on DeepSeek-V3 with H800 hardware show TTFT reductions of up to 40% and throughput gains around 15–22%, along with substantial Prefill chunk utilization improvements, demonstrating the practical impact for scalable, high-parameter DP+EP inference systems.

Abstract

The evolution of Large Language Model (LLM) serving towards complex, distributed architectures--specifically the P/D-separated, large-scale DP+EP paradigm--introduces distinct scheduling challenges. Unlike traditional deployments where schedulers can treat instances as black boxes, DP+EP architectures exhibit high internal synchronization costs. We identify that immediate request dispatching in such systems leads to severe in-engine queuing and parallelization bubbles, degrading Time-to-First-Token (TTFT). To address this, we propose Staggered Batch Scheduling (SBS), a mechanism that deliberately buffers requests to form optimal execution batches. This temporal decoupling eliminates internal queuing bubbles without compromising throughput. Furthermore, leveraging the scheduling window created by buffering, we introduce a Load-Aware Global Allocation strategy that balances computational load across DP units for both Prefill and Decode phases. Deployed on a production H800 cluster serving Deepseek-V3, our system reduces TTFT by 30%-40% and improves throughput by 15%-20% compared to state-of-the-art immediate scheduling baselines.

Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference

TL;DR

Abstract

Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)