Table of Contents
Fetching ...

Seesaw: High-throughput LLM Inference via Model Re-sharding

Qidong Su, Wei Zhao, Xin Li, Muralidhar Andoorveedu, Chenhao Jiang, Zhanda Zhu, Kevin Song, Christina Giannoula, Gennady Pekhimenko

TL;DR

Seesaw introduces dynamic model re-sharding to tailor parallelism for prefilling and decoding in throughput-focused LLM inference. It couples this with tiered KV cache buffering and transition-minimizing scheduling to reduce re-sharding overhead, supported by an asynchronous scheduler–worker design and CPU-GPU KV-cache sharing. Evaluations show average throughput gains of $1.36\times$ and up to $1.78\times$ over vLLM across PCIe and NVLink configurations, demonstrating robust improvements across model sizes and workloads. The approach enables higher batching, better resource utilization, and scalable offline inference for large LLMs.

Abstract

To improve the efficiency of distributed large language model (LLM) inference, various parallelization strategies, such as tensor and pipeline parallelism, have been proposed. However, the distinct computational characteristics inherent in the two stages of LLM inference-prefilling and decoding-render a single static parallelization strategy insufficient for the effective optimization of both stages. In this work, we present Seesaw, an LLM inference engine optimized for throughput-oriented tasks. The key idea behind Seesaw is dynamic model re-sharding, a technique that facilitates the dynamic reconfiguration of parallelization strategies across stages, thereby maximizing throughput at both phases. To mitigate re-sharding overhead and optimize computational efficiency, we employ tiered KV cache buffering and transition-minimizing scheduling. These approaches work synergistically to reduce the overhead caused by frequent stage transitions while ensuring maximum batching efficiency. Our evaluation demonstrates that Seesaw achieves a throughput increase of up to 1.78x (1.36x on average) compared to vLLM, the most widely used state-of-the-art LLM inference engine.

Seesaw: High-throughput LLM Inference via Model Re-sharding

TL;DR

Seesaw introduces dynamic model re-sharding to tailor parallelism for prefilling and decoding in throughput-focused LLM inference. It couples this with tiered KV cache buffering and transition-minimizing scheduling to reduce re-sharding overhead, supported by an asynchronous scheduler–worker design and CPU-GPU KV-cache sharing. Evaluations show average throughput gains of and up to over vLLM across PCIe and NVLink configurations, demonstrating robust improvements across model sizes and workloads. The approach enables higher batching, better resource utilization, and scalable offline inference for large LLMs.

Abstract

To improve the efficiency of distributed large language model (LLM) inference, various parallelization strategies, such as tensor and pipeline parallelism, have been proposed. However, the distinct computational characteristics inherent in the two stages of LLM inference-prefilling and decoding-render a single static parallelization strategy insufficient for the effective optimization of both stages. In this work, we present Seesaw, an LLM inference engine optimized for throughput-oriented tasks. The key idea behind Seesaw is dynamic model re-sharding, a technique that facilitates the dynamic reconfiguration of parallelization strategies across stages, thereby maximizing throughput at both phases. To mitigate re-sharding overhead and optimize computational efficiency, we employ tiered KV cache buffering and transition-minimizing scheduling. These approaches work synergistically to reduce the overhead caused by frequent stage transitions while ensuring maximum batching efficiency. Our evaluation demonstrates that Seesaw achieves a throughput increase of up to 1.78x (1.36x on average) compared to vLLM, the most widely used state-of-the-art LLM inference engine.

Paper Structure

This paper contains 60 sections, 6 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Breakdown of execution time for the prefill and decode stages for LLaMA2-13B inference on 8 L4 GPUs (The global batch size is 16. Pipeline parallelism further divides the data into micro-batches of size $16/\text{PP}$ to fully utilize pipelining).
  • Figure 2: Different scheduling policies considering transition overhead. Decoding throughput is positively correlated with the number of sequences in GPU memory (the maximal batch size), which is highlighted as light green area.
  • Figure 3: Different effects of tensor and pipeline parallelisms on prefilling and decoding. Tensor parallelism incurs all-reduce overhead, which has a higher percentage in prefilling, therefore pipeline parallelism is better for prefilling. Conversely, pipeline parallelism splits batches into smaller micro-batches, which leads to more forward passes and repetitive loading weights, which is insufficient in decoding.
  • Figure 4: An example of spatially disaggregating prefilling and decoding has a restricted search space. Deploying a 70B model on eight 40GiB GPUs allows only one disaggregation strategy: four GPUs for prefilling and four for decoding. However, this causes severe throughput mismatch between the two stages.
  • Figure 5: Model weights and KV cache need to be re-sharded when switching between different parallelism.
  • ...and 10 more figures