Table of Contents
Fetching ...

Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference

Jian Tian, Shuailong Li, Yang Cao, Wenbo Cui, Minghan Zhu, Wenkang Wu, Jianming Zhang, Yanpeng Wang, Zhiwen Xiao, Zhenyu Hou, Dou Shen

TL;DR

The paper tackles scheduling inefficiencies in large-scale DP+EP LLM inference, where immediate dispatch causes device-side queuing and HOL blocking. It introduces Staggered Batch Scheduling (SBS), which buffers requests into batches to form near-optimal execution batches and provides a global view for load balancing across Prefill and Decode. Key contributions include an adaptive scheduling interval, robust state synchronization, a fine-grained capacity model with water-filling allocation for Prefill, and a dual-objective, IQR-informed Decode scheduling strategy with lexicographical selection. Experimental results on DeepSeek-V3 with H800 hardware show TTFT reductions of up to 40% and throughput gains around 15–22%, along with substantial Prefill chunk utilization improvements, demonstrating the practical impact for scalable, high-parameter DP+EP inference systems.

Abstract

The evolution of Large Language Model (LLM) serving towards complex, distributed architectures--specifically the P/D-separated, large-scale DP+EP paradigm--introduces distinct scheduling challenges. Unlike traditional deployments where schedulers can treat instances as black boxes, DP+EP architectures exhibit high internal synchronization costs. We identify that immediate request dispatching in such systems leads to severe in-engine queuing and parallelization bubbles, degrading Time-to-First-Token (TTFT). To address this, we propose Staggered Batch Scheduling (SBS), a mechanism that deliberately buffers requests to form optimal execution batches. This temporal decoupling eliminates internal queuing bubbles without compromising throughput. Furthermore, leveraging the scheduling window created by buffering, we introduce a Load-Aware Global Allocation strategy that balances computational load across DP units for both Prefill and Decode phases. Deployed on a production H800 cluster serving Deepseek-V3, our system reduces TTFT by 30%-40% and improves throughput by 15%-20% compared to state-of-the-art immediate scheduling baselines.

Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference

TL;DR

The paper tackles scheduling inefficiencies in large-scale DP+EP LLM inference, where immediate dispatch causes device-side queuing and HOL blocking. It introduces Staggered Batch Scheduling (SBS), which buffers requests into batches to form near-optimal execution batches and provides a global view for load balancing across Prefill and Decode. Key contributions include an adaptive scheduling interval, robust state synchronization, a fine-grained capacity model with water-filling allocation for Prefill, and a dual-objective, IQR-informed Decode scheduling strategy with lexicographical selection. Experimental results on DeepSeek-V3 with H800 hardware show TTFT reductions of up to 40% and throughput gains around 15–22%, along with substantial Prefill chunk utilization improvements, demonstrating the practical impact for scalable, high-parameter DP+EP inference systems.

Abstract

The evolution of Large Language Model (LLM) serving towards complex, distributed architectures--specifically the P/D-separated, large-scale DP+EP paradigm--introduces distinct scheduling challenges. Unlike traditional deployments where schedulers can treat instances as black boxes, DP+EP architectures exhibit high internal synchronization costs. We identify that immediate request dispatching in such systems leads to severe in-engine queuing and parallelization bubbles, degrading Time-to-First-Token (TTFT). To address this, we propose Staggered Batch Scheduling (SBS), a mechanism that deliberately buffers requests to form optimal execution batches. This temporal decoupling eliminates internal queuing bubbles without compromising throughput. Furthermore, leveraging the scheduling window created by buffering, we introduce a Load-Aware Global Allocation strategy that balances computational load across DP units for both Prefill and Decode phases. Deployed on a production H800 cluster serving Deepseek-V3, our system reduces TTFT by 30%-40% and improves throughput by 15%-20% compared to state-of-the-art immediate scheduling baselines.

Paper Structure

This paper contains 23 sections, 4 equations, 8 figures, 1 table, 3 algorithms.

Figures (8)

  • Figure 1: Evolution of Scheduling Granularity.
  • Figure 2: Impact of Dispatch Strategy on Queuing Dynamics.
  • Figure 3: Synchronization Overhead under Immediate Dispatch. Due to the strict synchronization barrier in DP+EP architectures, the system throughput is bottlenecked by the slowest DP unit (Straggler). Greedy assignment leads to load imbalance, resulting in significant Parallelization Bubbles (marked as "Waste") where faster DPs idle-wait for stragglers .
  • Figure 4: Mitigation of Straggler Effect via Batched Bin-Packing. By buffering requests to form a batch, the Staggered Batch Scheduler gains a global view to apply "Water-Filling" allocation (Algorithm \ref{['alg:pbaa']}). This ensures uniform workload distribution across DP units, filling the bubbles seen in Figure \ref{['fig:load_balance_1']} and maximizing effective compute utilization .
  • Figure 5: System Architecture of the Staggered Batch Scheduler (SBS). The system is centered around a Main Schedule Loop that governs request dispatching. (1) Inference Instances, each consisting of multiple Data Parallel (DP) units, execute forward passes. Upon completion of a pass, they asynchronously send an EndForward Signal containing payload statistics (remaining token count and execution time) back to the scheduler. (2) The Global State & Feedback System acts as the source of truth, maintaining the Global State Matrix ($\langle C_{avail}, B_i, K_i \rangle$) updated by instance feedback, and dynamically calculating the optimal interval ($I_{opt}$) via Algorithm \ref{['alg:adaptive_interval']}. (3) The Schedule Loop waits for a dual trigger condition: the elapse of the calculated interval $I_{opt}$, AND the receipt of an EndForward notification from the next target instance. Once triggered, the scheduler batches pending requests and dispatches them to all DPs of the selected instance via the Policy Engine(Algorithm \ref{['alg:pbaa']} & \ref{['alg:iqr_schedule']}), initiating the next cycle.
  • ...and 3 more figures