Table of Contents
Fetching ...

Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads

Mert Hidayetoglu, Aurick Qiao, Michael Wyatt, Jeff Rasley, Yuxiong He, Samyam Rajbhandari

TL;DR

This work tackles the latency–throughput trade-off in large language model inference under dynamic, mixed workloads. It introduces Shift Parallelism, a dual-configuration strategy that switches between Ulysses Sequence Parallelism (SP) and Tensor Parallelism (TP) while preserving KV cache invariance, enabling seamless transitions. The approach is extended to inference, with GQA support and KV-cache replication, and is implemented as a vLLM plug-in, demonstrated across real-world traces and synthetic benchmarks to yield lower latency and higher throughput than TP or DP alone. The findings suggest Shift Parallelism delivers low latency for interactive scenarios and high throughput for batch workloads, with practical production deployment benefits and open-source availability.

Abstract

Efficient parallelism is necessary for achieving low-latency, high-throughput inference with large language models (LLMs). Tensor parallelism (TP) is the state-of-the-art method for reducing LLM response latency, however GPU communications reduces combined token throughput. On the other hand, data parallelism (DP) obtains a higher throughput yet is slow in response latency. Best of both worlds does not exist, and it is not possible to combine TP and DP because of the KV cache variance across the parallelisms. We notice Sequence Parallelism (SP - Ulysses in training) has similar properties as DP but with KV cache invariance. We adapt SP to inference, and combine it with TP to get the best of both worlds. Our solution: Shift Parallelism. Shift Parallelism dynamically switches across TP and SP, and minimizes latency in low traffic without losing throughput in high traffic. The efficient GPU communications of Shift Parallelism yields up to i) 1.51x faster response in interactive workloads and ii) 50% higher throughput in batch workloads, compared to a TP-only solution. We evaluate Shift Parallelism with real-world production traces with dynamic traffic patterns as well as synthetic benchmarking patterns across models, context sizes, and arrival rates. All results affirm the same: Shift Parallelism has a better the latency vs. throughput tradeoff than TP or DP, and hence obtains low latency without degrading throughput in dynamic workloads.

Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads

TL;DR

This work tackles the latency–throughput trade-off in large language model inference under dynamic, mixed workloads. It introduces Shift Parallelism, a dual-configuration strategy that switches between Ulysses Sequence Parallelism (SP) and Tensor Parallelism (TP) while preserving KV cache invariance, enabling seamless transitions. The approach is extended to inference, with GQA support and KV-cache replication, and is implemented as a vLLM plug-in, demonstrated across real-world traces and synthetic benchmarks to yield lower latency and higher throughput than TP or DP alone. The findings suggest Shift Parallelism delivers low latency for interactive scenarios and high throughput for batch workloads, with practical production deployment benefits and open-source availability.

Abstract

Efficient parallelism is necessary for achieving low-latency, high-throughput inference with large language models (LLMs). Tensor parallelism (TP) is the state-of-the-art method for reducing LLM response latency, however GPU communications reduces combined token throughput. On the other hand, data parallelism (DP) obtains a higher throughput yet is slow in response latency. Best of both worlds does not exist, and it is not possible to combine TP and DP because of the KV cache variance across the parallelisms. We notice Sequence Parallelism (SP - Ulysses in training) has similar properties as DP but with KV cache invariance. We adapt SP to inference, and combine it with TP to get the best of both worlds. Our solution: Shift Parallelism. Shift Parallelism dynamically switches across TP and SP, and minimizes latency in low traffic without losing throughput in high traffic. The efficient GPU communications of Shift Parallelism yields up to i) 1.51x faster response in interactive workloads and ii) 50% higher throughput in batch workloads, compared to a TP-only solution. We evaluate Shift Parallelism with real-world production traces with dynamic traffic patterns as well as synthetic benchmarking patterns across models, context sizes, and arrival rates. All results affirm the same: Shift Parallelism has a better the latency vs. throughput tradeoff than TP or DP, and hence obtains low latency without degrading throughput in dynamic workloads.

Paper Structure

This paper contains 53 sections, 1 equation, 17 figures, 5 tables, 2 algorithms.

Figures (17)

  • Figure 1: Comparison of response speed (#input tok./TTFT) and generation rate (1/TPOT), and throughput (tokens/sec). Shift Parallelism obtains a higher throughput than TP in high traffic, and lower latency than TP and DP in low traffic.
  • Figure 2: Bursty workload.
  • Figure 3: Parallelization of the vanilla transformer on two GPUs with TP and SP. The attention has four heads which are parallelized across heads. In (b), SP (1) partitions the input sequence, (2) switches to head parallelism using an all-to-all communication, applies head parallelization to attention, and (3) returns back to SP.
  • Figure 4: Vanilla transformer architecture and the attention mechanism.
  • Figure 5: Although SP and TP are essentially different parallelisms, Shift Parallelism exploits the KV cache invariance between SP and TP for swiftly switching between them.
  • ...and 12 more figures