Table of Contents
Fetching ...

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

Jiaao He, Jidong Zhai

TL;DR

FastDecode addresses the high cost of serving autoregressive LLMs by decomposing the transformer into a memory-bound R-Part that handles KV-cache and a compute-bound S-Part that runs on GPUs. It shifts the R-Part to distributed out-of-chassis CPUs, exchanging only small intermediate vectors, and introduces a sequence-level load-stabilizing schedule plus a model-guided hardware selection to balance heterogeneous hardware. The system achieves 1.88x–5.04x throughput compared with vLLM on the same GPU across modern models, with scalable multi-node CPU involvement and mixed-precision/quantization optimizations. This approach demonstrates a practical, cost-effective path to high-throughput LLM serving by exploiting CPU memory bandwidth and cross-node parallelism.

Abstract

Cost of serving large language models (LLM) is high, but the expensive and scarce GPUs are poorly efficient when generating tokens sequentially, unless the batch of sequences is enlarged. However, the batch size is limited by some constantly reused intermediate results, namely KV-Cache. They occupy too much memory to fit more sequences into a GPU simultaneously. While they could be offloaded to host memory, the CPU-GPU bandwidth is an inevitable bottleneck. We find a way to decompose the transformer models into two parts of different characteristics, one of which includes the memory-bound KV-Cache accessing. Our key insight is that the aggregated memory capacity, bandwidth, and computing power of CPUs across multiple nodes is an efficient option to process this part. Performance improvement comes from reduced data transmission overhead and boosted GPU throughput to process the other model part. Moreover, we address efficiency challenges brought by heterogeneity at both temporal and inter-device scopes using scheduling and performance modeling techniques. Evaluation results show that our system achieves 1.88x - 5.04x the throughput of vLLM when serving modern LLMs with the same GPU.

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

TL;DR

FastDecode addresses the high cost of serving autoregressive LLMs by decomposing the transformer into a memory-bound R-Part that handles KV-cache and a compute-bound S-Part that runs on GPUs. It shifts the R-Part to distributed out-of-chassis CPUs, exchanging only small intermediate vectors, and introduces a sequence-level load-stabilizing schedule plus a model-guided hardware selection to balance heterogeneous hardware. The system achieves 1.88x–5.04x throughput compared with vLLM on the same GPU across modern models, with scalable multi-node CPU involvement and mixed-precision/quantization optimizations. This approach demonstrates a practical, cost-effective path to high-throughput LLM serving by exploiting CPU memory bandwidth and cross-node parallelism.

Abstract

Cost of serving large language models (LLM) is high, but the expensive and scarce GPUs are poorly efficient when generating tokens sequentially, unless the batch of sequences is enlarged. However, the batch size is limited by some constantly reused intermediate results, namely KV-Cache. They occupy too much memory to fit more sequences into a GPU simultaneously. While they could be offloaded to host memory, the CPU-GPU bandwidth is an inevitable bottleneck. We find a way to decompose the transformer models into two parts of different characteristics, one of which includes the memory-bound KV-Cache accessing. Our key insight is that the aggregated memory capacity, bandwidth, and computing power of CPUs across multiple nodes is an efficient option to process this part. Performance improvement comes from reduced data transmission overhead and boosted GPU throughput to process the other model part. Moreover, we address efficiency challenges brought by heterogeneity at both temporal and inter-device scopes using scheduling and performance modeling techniques. Evaluation results show that our system achieves 1.88x - 5.04x the throughput of vLLM when serving modern LLMs with the same GPU.
Paper Structure (31 sections, 11 equations, 15 figures, 3 tables, 1 algorithm)

This paper contains 31 sections, 11 equations, 15 figures, 3 tables, 1 algorithm.

Figures (15)

  • Figure 1: Memory footprint of KV-cache stops increasing GPU utilization by enlarging batch size
  • Figure 2: Performance characteristics of typical GPUs and CPUs, matching the need of two parts of the model
  • Figure 3: Performance dilemma in auto-regressive generation
  • Figure 4: Workers of FastDecode
  • Figure 5: Temporal view of FastDecode
  • ...and 10 more figures