Table of Contents
Fetching ...

NanoFlow: Towards Optimal Large Language Model Serving Throughput

Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Tian Tang, Qinyu Xu, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Ziren Wang, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci

TL;DR

The paper analyzes end-to-end LLM serving throughputs and shows that, despite memory-bound components, typical workloads are compute-bound when considering the full inference pipeline. It introduces NanoFlow, a framework that uses intra-device parallelism via nano-batching and a two-stage auto-search to overlap heterogeneous operations, supported by a runtime for scheduling and KV-cache management. Empirical results on LLaMA-2 70B and other models demonstrate a 1.91x throughput improvement over state-of-the-art baselines and ~68.5% of the theoretically optimal throughput, with broad applicability to other architectures. The work provides a practical pathway to approach optimal serving throughput for planet-scale LLM deployments and offers a reusable auto-search methodology for diverse models and hardware.

Abstract

Large Language Models (LLMs) have resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput has emerged as a key metric that determines serving systems' performance. Due to large model sizes and memory-intensive self-attention, LLM serving has been commonly assumed to be memory-bound. Through a detailed analysis, we show that despite having memory-intensive components, end-to-end LLM serving is compute bound for most common workloads and LLMs. Alas, most existing serving engines fall short from optimal compute utilization, because the heterogeneous operations that comprise LLM serving--compute, memory, networking--are executed sequentially within a device. We propose NanoFlow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of heterogeneous resources within a single device. NanoFlow splits inputs into smaller nano-batches and duplicates operations to operate on each portion independently, enabling overlapping. NanoFlow automatically identifies the number, size, ordering, and GPU resource allocation of nano-batches to minimize the execution time, while considering the interference of concurrent operations. We evaluate NanoFlow's end-to-end serving throughput on several popular models such as LLaMA-2-70B, Mixtral 8x7B, LLaMA-3-8B, etc. With practical workloads, NanoFlow provides 1.91x throughput boost compared to state-of-the-art serving systems achieving 50% to 72% of optimal throughput across popular models.

NanoFlow: Towards Optimal Large Language Model Serving Throughput

TL;DR

The paper analyzes end-to-end LLM serving throughputs and shows that, despite memory-bound components, typical workloads are compute-bound when considering the full inference pipeline. It introduces NanoFlow, a framework that uses intra-device parallelism via nano-batching and a two-stage auto-search to overlap heterogeneous operations, supported by a runtime for scheduling and KV-cache management. Empirical results on LLaMA-2 70B and other models demonstrate a 1.91x throughput improvement over state-of-the-art baselines and ~68.5% of the theoretically optimal throughput, with broad applicability to other architectures. The work provides a practical pathway to approach optimal serving throughput for planet-scale LLM deployments and offers a reusable auto-search methodology for diverse models and hardware.

Abstract

Large Language Models (LLMs) have resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput has emerged as a key metric that determines serving systems' performance. Due to large model sizes and memory-intensive self-attention, LLM serving has been commonly assumed to be memory-bound. Through a detailed analysis, we show that despite having memory-intensive components, end-to-end LLM serving is compute bound for most common workloads and LLMs. Alas, most existing serving engines fall short from optimal compute utilization, because the heterogeneous operations that comprise LLM serving--compute, memory, networking--are executed sequentially within a device. We propose NanoFlow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of heterogeneous resources within a single device. NanoFlow splits inputs into smaller nano-batches and duplicates operations to operate on each portion independently, enabling overlapping. NanoFlow automatically identifies the number, size, ordering, and GPU resource allocation of nano-batches to minimize the execution time, while considering the interference of concurrent operations. We evaluate NanoFlow's end-to-end serving throughput on several popular models such as LLaMA-2-70B, Mixtral 8x7B, LLaMA-3-8B, etc. With practical workloads, NanoFlow provides 1.91x throughput boost compared to state-of-the-art serving systems achieving 50% to 72% of optimal throughput across popular models.
Paper Structure (32 sections, 6 equations, 11 figures, 4 tables)

This paper contains 32 sections, 6 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Transformer architecture. The operations in the yellow boxes have large batch sizes and share model weight parameters across requests; hence, they are compute-bound. Operations in green boxes require loading a unique KV cache for each request; hence, they are memory-bound. The blue box represents network operations that perform synchronization between operations.
  • Figure 2: Comparison of network time and compute time. The closer to yellow, the more compute-bound the workload is, whereas the closer to blue indicates the workload is more network-bound.
  • Figure 3: Comparison of compute time and memory time. The closer to yellow, the more compute-bound the workload is, whereas the closer to green the more memory-bound it becomes.
  • Figure 4: Execution pipeline of existing systems. The green, yellow, and blue operations correspond to memory-, compute-, and network-bound operations. Operations in the previous and next layer are denoted by dotted borders. "WASTED" shows the stages in the pipeline where the most constrained resource, compute, is underutilized. Small operations (i.e. layernorm, activation, etc.) are omitted for simplicity.
  • Figure 5: Interference characteristics between GEMM and GEMV kernels. The points on the x-axis correspond unique GEMM-GEMV implementation pairs. The y-axis denotes the GEMM and GEMV kernels' normalized performance $P$.
  • ...and 6 more figures