HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

Xuanlei Zhao; Bin Jia; Haotian Zhou; Ziming Liu; Shenggan Cheng; Yang You

HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

Xuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, Yang You

TL;DR

A novel approach that presents a principled framework for heterogeneous parallel computing using CPUs and GPUs and employs heterogeneous parallel computing and asynchronous overlap for LLMs to mitigate I/O bottlenecks to achieve low-latency LLMs inference on resource-constrained devices is introduced.

Abstract

In recent times, the emergence of Large Language Models (LLMs) has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory inference but often suffer from efficiency due to I/O bottlenecks. To achieve low-latency LLMs inference on resource-constrained devices, we introduce HeteGen, a novel approach that presents a principled framework for heterogeneous parallel computing using CPUs and GPUs. Based on this framework, HeteGen further employs heterogeneous parallel computing and asynchronous overlap for LLMs to mitigate I/O bottlenecks. Our experiments demonstrate a substantial improvement in inference speed, surpassing state-of-the-art methods by over 317% at most.

HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

TL;DR

Abstract

Paper Structure (25 sections, 11 equations, 8 figures, 3 tables)

This paper contains 25 sections, 11 equations, 8 figures, 3 tables.

Introduction
Background
Generative Language Models Inference
Memory Analysis
Offloading Bottleneck
Heterogeneous Parallelism
Parallelism Strategy
Computation Distribution
HeteGen
Overview
Hybrid Heterogeneous Parallelism
Asynchronous Parameter Manager
Alpha Benchmark
Heterogeneous Module Scheduler
Experiments
...and 10 more sections

Figures (8)

Figure 1: Memory space and processing speed for GPU, CPU, and I/O between CPU and GPU. The speed is tested with OPT-30B MLP Linear on NVIDIA A10 GPU and Intel Xeon @ 2.30GHz CPU, calculated as parameter size divided by processing time.
Figure 2: Memory usage in OPT-30B. Batch size is 1, and sequence length is 512.
Figure 3: A straightforward illustration of heterogeneous parallelism. COM denotes parameter communication, $P_i$ refers to the i-th part of the model and the black line represents data exchange.
Figure 4: Overview of HeteGen. HeteGen has two main stages: scheduling and runtime. In the scheduling stage, it uses the alpha benchmark to distribute computation and decides on parameter policies based on our scheduler. In the runtime stage, it optimizes I/O and CPU utilization within heterogeneous modules using hybrid parallelism and manages asynchronous weights to minimize system impact.
Figure 5: Demonstration of different heterogeneous parallelism strategies.
...and 3 more figures

HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

TL;DR

Abstract

HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

Authors

TL;DR

Abstract

Table of Contents

Figures (8)