Table of Contents
Fetching ...

HexGen: Generative Inference of Large Language Model over Heterogeneous Environment

Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, Binhang Yuan

TL;DR

The paper tackles the high cost of large language model generative inference in heterogeneous, cross-datacenter settings. It introduces HexGen, which supports asymmetric tensor model and pipeline parallelism across diverse GPUs and a constrained-optimization scheduler combining dynamic programming and genetic search to optimize layout, communication, and memory. Empirical results on Llama-2-70B show HexGen achieves up to 2.3x lower latency deadlines or 4x higher traffic handling compared to homogeneous baselines, and outperforms Petals by up to 10x in throughput under half-budget scenarios. The approach offers a path toward more economical and scalable deployment of foundation models across heterogeneous infrastructure, with open-source release planned.

Abstract

Serving generative inference of the large language model is a crucial component of contemporary AI applications. This paper focuses on deploying such services in a heterogeneous and cross-datacenter setting to mitigate the substantial inference costs typically associated with a single centralized datacenter. Towards this end, we propose HexGen, a flexible distributed inference engine that uniquely supports the asymmetric partition of generative inference computations over both tensor model parallelism and pipeline parallelism and allows for effective deployment across diverse GPUs interconnected by a fully heterogeneous network. We further propose a sophisticated scheduling algorithm grounded in constrained optimization that can adaptively assign asymmetric inference computation across the GPUs to fulfill inference requests while maintaining acceptable latency levels. We conduct an extensive evaluation to verify the efficiency of HexGen by serving the state-of-the-art Llama-2 (70B) model. The results suggest that HexGen can choose to achieve up to 2.3 times lower latency deadlines or tolerate up to 4 times more request rates compared with the homogeneous baseline given the same budget.

HexGen: Generative Inference of Large Language Model over Heterogeneous Environment

TL;DR

The paper tackles the high cost of large language model generative inference in heterogeneous, cross-datacenter settings. It introduces HexGen, which supports asymmetric tensor model and pipeline parallelism across diverse GPUs and a constrained-optimization scheduler combining dynamic programming and genetic search to optimize layout, communication, and memory. Empirical results on Llama-2-70B show HexGen achieves up to 2.3x lower latency deadlines or 4x higher traffic handling compared to homogeneous baselines, and outperforms Petals by up to 10x in throughput under half-budget scenarios. The approach offers a path toward more economical and scalable deployment of foundation models across heterogeneous infrastructure, with open-source release planned.

Abstract

Serving generative inference of the large language model is a crucial component of contemporary AI applications. This paper focuses on deploying such services in a heterogeneous and cross-datacenter setting to mitigate the substantial inference costs typically associated with a single centralized datacenter. Towards this end, we propose HexGen, a flexible distributed inference engine that uniquely supports the asymmetric partition of generative inference computations over both tensor model parallelism and pipeline parallelism and allows for effective deployment across diverse GPUs interconnected by a fully heterogeneous network. We further propose a sophisticated scheduling algorithm grounded in constrained optimization that can adaptively assign asymmetric inference computation across the GPUs to fulfill inference requests while maintaining acceptable latency levels. We conduct an extensive evaluation to verify the efficiency of HexGen by serving the state-of-the-art Llama-2 (70B) model. The results suggest that HexGen can choose to achieve up to 2.3 times lower latency deadlines or tolerate up to 4 times more request rates compared with the homogeneous baseline given the same budget.
Paper Structure (22 sections, 11 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 11 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Case study of parallel strategy over heterogeneity.
  • Figure 2: SLO attainment results to evaluate cost performance trade-offs. Each row corresponds to a particular output sequence length (32, 64, 128). The first four columns correspond to different SLO scales ranging from 8 to 0.125 requests per second. The last column represents the performance comparison of various settings at different request rates.
  • Figure 3: HexGen and Petals. Two rows correspond to output sequence lengths of $32$ and $64$. First two columns illustrate the results of different SLO scales. The last column shows the effects of request rate on SLO attainment.
  • Figure 4: SLO attainment results of HexGen compared with HexGen with 4 GPUs offline.
  • Figure 5: HexGen v.s. Huggingface-TGI. Two rows represent output sequence lengths of $32$ and $64$. First two columns show the result of different SLO scales. The last column shows the effects of request rate on SLO attainment.
  • ...and 2 more figures