Table of Contents
Fetching ...

HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

Xuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, Yang You

TL;DR

A novel approach that presents a principled framework for heterogeneous parallel computing using CPUs and GPUs and employs heterogeneous parallel computing and asynchronous overlap for LLMs to mitigate I/O bottlenecks to achieve low-latency LLMs inference on resource-constrained devices is introduced.

Abstract

In recent times, the emergence of Large Language Models (LLMs) has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory inference but often suffer from efficiency due to I/O bottlenecks. To achieve low-latency LLMs inference on resource-constrained devices, we introduce HeteGen, a novel approach that presents a principled framework for heterogeneous parallel computing using CPUs and GPUs. Based on this framework, HeteGen further employs heterogeneous parallel computing and asynchronous overlap for LLMs to mitigate I/O bottlenecks. Our experiments demonstrate a substantial improvement in inference speed, surpassing state-of-the-art methods by over 317% at most.

HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

TL;DR

A novel approach that presents a principled framework for heterogeneous parallel computing using CPUs and GPUs and employs heterogeneous parallel computing and asynchronous overlap for LLMs to mitigate I/O bottlenecks to achieve low-latency LLMs inference on resource-constrained devices is introduced.

Abstract

In recent times, the emergence of Large Language Models (LLMs) has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory inference but often suffer from efficiency due to I/O bottlenecks. To achieve low-latency LLMs inference on resource-constrained devices, we introduce HeteGen, a novel approach that presents a principled framework for heterogeneous parallel computing using CPUs and GPUs. Based on this framework, HeteGen further employs heterogeneous parallel computing and asynchronous overlap for LLMs to mitigate I/O bottlenecks. Our experiments demonstrate a substantial improvement in inference speed, surpassing state-of-the-art methods by over 317% at most.
Paper Structure (25 sections, 11 equations, 8 figures, 3 tables)

This paper contains 25 sections, 11 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Memory space and processing speed for GPU, CPU, and I/O between CPU and GPU. The speed is tested with OPT-30B MLP Linear on NVIDIA A10 GPU and Intel Xeon @ 2.30GHz CPU, calculated as parameter size divided by processing time.
  • Figure 2: Memory usage in OPT-30B. Batch size is 1, and sequence length is 512.
  • Figure 3: A straightforward illustration of heterogeneous parallelism. COM denotes parameter communication, $P_i$ refers to the i-th part of the model and the black line represents data exchange.
  • Figure 4: Overview of HeteGen. HeteGen has two main stages: scheduling and runtime. In the scheduling stage, it uses the alpha benchmark to distribute computation and decides on parameter policies based on our scheduler. In the runtime stage, it optimizes I/O and CPU utilization within heterogeneous modules using hybrid parallelism and manages asynchronous weights to minimize system impact.
  • Figure 5: Demonstration of different heterogeneous parallelism strategies.
  • ...and 3 more figures