Table of Contents
Fetching ...

HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment

Youhe Jiang, Ran Yan, Binhang Yuan

TL;DR

HexGen-2 tackles the challenge of scalable, cost-effective LLM inference on heterogeneous GPU pools by disaggregating the prefill and decoding phases. It introduces a constraint-optimization scheduler that combines graph partitioning and max-flow refinement to assign model replicas, parallel strategies, and KV-cache paths. Empirical results on OPT-30B and Llama-2-70B show up to 2.0x higher throughput, ~1.3x average improvements, and 1.5x latency reductions under the same budget, and comparable performance with 30% lower budget. The work demonstrates that disaggregated inference with heterogeneity-aware scheduling can significantly reduce deployment costs while preserving or improving service quality.

Abstract

Disaggregating the prefill and decoding phases represents an effective new paradigm for generative inference of large language models (LLM), which eliminates prefill-decoding interference and optimizes resource allocation. However, it is still an open problem about how to deploy the disaggregated inference paradigm across a group of heterogeneous GPUs, which can be an economical alternative to deployment over homogeneous high-performance GPUs. Towards this end, we introduce HexGen-2, a distributed system for efficient and economical LLM serving on heterogeneous GPUs following the disaggregated paradigm. Built on top of HexGen, the core component of HexGen-2 is a scheduling algorithm that formalizes the allocation of disaggregated LLM inference computations and communications over heterogeneous GPUs and network connections as a constraint optimization problem. We leverage the graph partitioning and max-flow algorithms to co-optimize resource allocation, parallel strategies for distinct inference phases, and the efficiency of inter-phase key-value (KV) cache communications. We conduct extensive experiments to evaluate HexGen-2, i.e., on OPT (30B) and Llama-2 (70B) models in various real-world settings, the results reveal that HexGen-2 delivers up to a 2.0 times and on average a 1.3 times improvement in serving throughput, reduces the average inference latency by 1.5 times compared with state-of-the-art systems given the same price budget, and achieves comparable inference performance with a 30% lower price budget.

HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment

TL;DR

HexGen-2 tackles the challenge of scalable, cost-effective LLM inference on heterogeneous GPU pools by disaggregating the prefill and decoding phases. It introduces a constraint-optimization scheduler that combines graph partitioning and max-flow refinement to assign model replicas, parallel strategies, and KV-cache paths. Empirical results on OPT-30B and Llama-2-70B show up to 2.0x higher throughput, ~1.3x average improvements, and 1.5x latency reductions under the same budget, and comparable performance with 30% lower budget. The work demonstrates that disaggregated inference with heterogeneity-aware scheduling can significantly reduce deployment costs while preserving or improving service quality.

Abstract

Disaggregating the prefill and decoding phases represents an effective new paradigm for generative inference of large language models (LLM), which eliminates prefill-decoding interference and optimizes resource allocation. However, it is still an open problem about how to deploy the disaggregated inference paradigm across a group of heterogeneous GPUs, which can be an economical alternative to deployment over homogeneous high-performance GPUs. Towards this end, we introduce HexGen-2, a distributed system for efficient and economical LLM serving on heterogeneous GPUs following the disaggregated paradigm. Built on top of HexGen, the core component of HexGen-2 is a scheduling algorithm that formalizes the allocation of disaggregated LLM inference computations and communications over heterogeneous GPUs and network connections as a constraint optimization problem. We leverage the graph partitioning and max-flow algorithms to co-optimize resource allocation, parallel strategies for distinct inference phases, and the efficiency of inter-phase key-value (KV) cache communications. We conduct extensive experiments to evaluate HexGen-2, i.e., on OPT (30B) and Llama-2 (70B) models in various real-world settings, the results reveal that HexGen-2 delivers up to a 2.0 times and on average a 1.3 times improvement in serving throughput, reduces the average inference latency by 1.5 times compared with state-of-the-art systems given the same price budget, and achieves comparable inference performance with a 30% lower price budget.

Paper Structure

This paper contains 25 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Effects of batching on different phases (Llama-2 (7B) inference with an input length of 512 on a single A100 GPU).
  • Figure 2: Illustration of disaggregated paradigm.
  • Figure 3: Illustration of each scheduling step.
  • Figure 4: Communication bandwidth (Gbps) matrix for different settings. Homogeneous setting contains $8\times$H100 GPUs with a budget of 29.5 $\$/h$; heterogeneous setting 1 contains $2\times$H100, $6\times$A100, $4\times$L40 and $8\times$A6000 GPUs with a budget of 28.8 $\$/h$; heterogeneous setting 2 contains $3\times$H100 and A100, $6\times$L40 and A6000 GPUs with a budget of 26.9 $\$/h$; heterogeneous setting 3 contains $6\times$A100 and A6000, $12\times$L40 GPUs with a budget of 27.1 $\$/h$; heterogeneous setting 4 contains $3\times$H100 and $9\times$A100 GPUs with a budget of 26.3 $\$/h$; heterogeneous setting 5 contains $4\times$A100, $6\times$L40 and $10\times$A6000 with a $70\%$ budget of 20.5 $\$/h$.
  • Figure 5: Request traces for online testing.
  • ...and 6 more figures