Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads
Mert Hidayetoglu, Aurick Qiao, Michael Wyatt, Jeff Rasley, Yuxiong He, Samyam Rajbhandari
TL;DR
This work tackles the latency–throughput trade-off in large language model inference under dynamic, mixed workloads. It introduces Shift Parallelism, a dual-configuration strategy that switches between Ulysses Sequence Parallelism (SP) and Tensor Parallelism (TP) while preserving KV cache invariance, enabling seamless transitions. The approach is extended to inference, with GQA support and KV-cache replication, and is implemented as a vLLM plug-in, demonstrated across real-world traces and synthetic benchmarks to yield lower latency and higher throughput than TP or DP alone. The findings suggest Shift Parallelism delivers low latency for interactive scenarios and high throughput for batch workloads, with practical production deployment benefits and open-source availability.
Abstract
Efficient parallelism is necessary for achieving low-latency, high-throughput inference with large language models (LLMs). Tensor parallelism (TP) is the state-of-the-art method for reducing LLM response latency, however GPU communications reduces combined token throughput. On the other hand, data parallelism (DP) obtains a higher throughput yet is slow in response latency. Best of both worlds does not exist, and it is not possible to combine TP and DP because of the KV cache variance across the parallelisms. We notice Sequence Parallelism (SP - Ulysses in training) has similar properties as DP but with KV cache invariance. We adapt SP to inference, and combine it with TP to get the best of both worlds. Our solution: Shift Parallelism. Shift Parallelism dynamically switches across TP and SP, and minimizes latency in low traffic without losing throughput in high traffic. The efficient GPU communications of Shift Parallelism yields up to i) 1.51x faster response in interactive workloads and ii) 50% higher throughput in batch workloads, compared to a TP-only solution. We evaluate Shift Parallelism with real-world production traces with dynamic traffic patterns as well as synthetic benchmarking patterns across models, context sizes, and arrival rates. All results affirm the same: Shift Parallelism has a better the latency vs. throughput tradeoff than TP or DP, and hence obtains low latency without degrading throughput in dynamic workloads.
