FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

Jiaao He; Jidong Zhai

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

Jiaao He, Jidong Zhai

TL;DR

FastDecode addresses the high cost of serving autoregressive LLMs by decomposing the transformer into a memory-bound R-Part that handles KV-cache and a compute-bound S-Part that runs on GPUs. It shifts the R-Part to distributed out-of-chassis CPUs, exchanging only small intermediate vectors, and introduces a sequence-level load-stabilizing schedule plus a model-guided hardware selection to balance heterogeneous hardware. The system achieves 1.88x–5.04x throughput compared with vLLM on the same GPU across modern models, with scalable multi-node CPU involvement and mixed-precision/quantization optimizations. This approach demonstrates a practical, cost-effective path to high-throughput LLM serving by exploiting CPU memory bandwidth and cross-node parallelism.

Abstract

Cost of serving large language models (LLM) is high, but the expensive and scarce GPUs are poorly efficient when generating tokens sequentially, unless the batch of sequences is enlarged. However, the batch size is limited by some constantly reused intermediate results, namely KV-Cache. They occupy too much memory to fit more sequences into a GPU simultaneously. While they could be offloaded to host memory, the CPU-GPU bandwidth is an inevitable bottleneck. We find a way to decompose the transformer models into two parts of different characteristics, one of which includes the memory-bound KV-Cache accessing. Our key insight is that the aggregated memory capacity, bandwidth, and computing power of CPUs across multiple nodes is an efficient option to process this part. Performance improvement comes from reduced data transmission overhead and boosted GPU throughput to process the other model part. Moreover, we address efficiency challenges brought by heterogeneity at both temporal and inter-device scopes using scheduling and performance modeling techniques. Evaluation results show that our system achieves 1.88x - 5.04x the throughput of vLLM when serving modern LLMs with the same GPU.

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

TL;DR

Abstract

Paper Structure (31 sections, 11 equations, 15 figures, 3 tables, 1 algorithm)

This paper contains 31 sections, 11 equations, 15 figures, 3 tables, 1 algorithm.

Introduction
Background and Motivation
Transformer Model and KV-Cache
Accelerating Decoding
Memory-bound Workload Fits CPU
Observation and Insights
Performance Dilemma and Decomposition
CPUs can Undertake More in LLM
Methodology
System Overview
Sequence-level Load-Stabilizing Schedule
Workload-balanced Hardware Selection
Implementation
Mix-precision CPU Attention
Supporting Quantization
...and 16 more sections

Figures (15)

Figure 1: Memory footprint of KV-cache stops increasing GPU utilization by enlarging batch size
Figure 2: Performance characteristics of typical GPUs and CPUs, matching the need of two parts of the model
Figure 3: Performance dilemma in auto-regressive generation
Figure 4: Workers of FastDecode
Figure 5: Temporal view of FastDecode
...and 10 more figures

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

TL;DR

Abstract

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

Authors

TL;DR

Abstract

Table of Contents

Figures (15)