Table of Contents
Fetching ...

gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling

Tianyu Guo, Xianwei Zhang, Jiangsu Du, Zhiguang Chen, Nong Xiao, Yutong Lu

TL;DR

gLLM proposes a global balanced pipeline framework with Token Throttling to address pipeline bubbles in distributed LLM serving. By separately throttling prefill and decode tokens based on real-time system state and KV-cache pressure, it achieves more even micro-batch workloads and reduces idle GPU time. The asynchronous, multi-process runtime supports decoupled frontend-backend processing and proactive metadata scheduling, enabling higher throughput and lower latency than state-of-the-art pipeline or tensor parallelism systems. Across multiple models and hardware setups, gLLM delivers up to 11%–398% higher maximum throughput and better SLO attainment, demonstrating practical impact for low-latency, high-throughput LLM serving at scale.

Abstract

Pipeline parallelism has emerged as a predominant approach for deploying large language models (LLMs) across distributed nodes, owing to its lower communication overhead compared to tensor parallelism. While demonstrating high throughput in request serving, pipeline parallelism often suffers from performance limitations caused by pipeline bubbles, which are primarily resulted from imbalanced computation delays across batches. Existing methods like Sarathi-Serve attempt to address this through hybrid scheduling of chunked prefill and decode tokens using a fixed token budget. However, such methods may experience significant fluctuations due to either insufficient prefill tokens or uneven distribution of decode tokens, ultimately leading to computational imbalance. To overcome these inefficiencies, we present gLLM, a globally balanced pipeline parallelism system incorporating Token Throttling to effectively mitigate the pipeline bubbles. Our Token Throttling mechanism is a fine-grained scheduling policy that independently regulates the quantities of prefill and decode tokens, thus enabling balanced computation by leveraging global information from the inference system. Specifically, for decode tokens, gLLM maintains near-consistent token count across processing batches. For prefill tokens, it dynamically adjusts batch sizes based on both total pending tokens and the memory utilization rates of key-value cache (KV cache). Furthermore, gLLM runtime adopts an asynchronous execution and message passing architecture specifically optimized for pipeline parallelism characteristics. Experimental evaluations with representative LLMs show that gLLM achieves significant performance improvements, delivering 11% to 398% higher maximum throughput compared to state-of-the-art pipeline or tensor parallelism systems, while simultaneously maintaining lower latency.

gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling

TL;DR

gLLM proposes a global balanced pipeline framework with Token Throttling to address pipeline bubbles in distributed LLM serving. By separately throttling prefill and decode tokens based on real-time system state and KV-cache pressure, it achieves more even micro-batch workloads and reduces idle GPU time. The asynchronous, multi-process runtime supports decoupled frontend-backend processing and proactive metadata scheduling, enabling higher throughput and lower latency than state-of-the-art pipeline or tensor parallelism systems. Across multiple models and hardware setups, gLLM delivers up to 11%–398% higher maximum throughput and better SLO attainment, demonstrating practical impact for low-latency, high-throughput LLM serving at scale.

Abstract

Pipeline parallelism has emerged as a predominant approach for deploying large language models (LLMs) across distributed nodes, owing to its lower communication overhead compared to tensor parallelism. While demonstrating high throughput in request serving, pipeline parallelism often suffers from performance limitations caused by pipeline bubbles, which are primarily resulted from imbalanced computation delays across batches. Existing methods like Sarathi-Serve attempt to address this through hybrid scheduling of chunked prefill and decode tokens using a fixed token budget. However, such methods may experience significant fluctuations due to either insufficient prefill tokens or uneven distribution of decode tokens, ultimately leading to computational imbalance. To overcome these inefficiencies, we present gLLM, a globally balanced pipeline parallelism system incorporating Token Throttling to effectively mitigate the pipeline bubbles. Our Token Throttling mechanism is a fine-grained scheduling policy that independently regulates the quantities of prefill and decode tokens, thus enabling balanced computation by leveraging global information from the inference system. Specifically, for decode tokens, gLLM maintains near-consistent token count across processing batches. For prefill tokens, it dynamically adjusts batch sizes based on both total pending tokens and the memory utilization rates of key-value cache (KV cache). Furthermore, gLLM runtime adopts an asynchronous execution and message passing architecture specifically optimized for pipeline parallelism characteristics. Experimental evaluations with representative LLMs show that gLLM achieves significant performance improvements, delivering 11% to 398% higher maximum throughput compared to state-of-the-art pipeline or tensor parallelism systems, while simultaneously maintaining lower latency.

Paper Structure

This paper contains 32 sections, 4 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: A comparison of the scheduled token counts of prefill and decode stage between Sarathi-Serve and an ideal balanced system per iteration (horizontal axis), with the maximum batched token count (token budget) set to 2048 for both systems.
  • Figure 2: Comparison of data parallelism, tensor parallelism and pipeline parallelism.
  • Figure 3: Pipeline bubbles caused by two types of imbalance, i.e., inter-stage and inter-batch. (the number indicates the ordinal position of the micro-batch)
  • Figure 4: Under-utilized GPU caused by unbalanced scheduling.
  • Figure 5: Comparison between coupled and decoupled scheduling.
  • ...and 11 more figures