Table of Contents
Fetching ...

HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models

Si Xu, Zixiao Huang, Yan Zeng, Shengen Yan, Xuefei Ning, Quanlu Zhang, Haolin Ye, Sipei Gu, Chunsheng Shui, Zhezheng Lin, Hao Zhang, Sheng Wang, Guohao Dai, Yu Wang

TL;DR

The paper tackles the challenge of training extremely large models on heterogeneous GPU clusters by introducing HETHUB, a distributed training system with hybrid parallelism. It combines an Infinity Collective Communication Library (ICCL) to unify inter-device communication, a distributed performance predictor to estimate strategy performance, and an automatic parallel planner to automatically discover efficient training configurations. The approach is validated on Llama2-140B, achieving up to 97.49% of the theoretical upper bound and demonstrating strong throughput, MFU, and end-to-end improvements across heterogeneous clusters. The work has practical significance by enabling scalable, resource-diverse training of large-scale models with efficient utilization of mixed hardware resources and reduced development complexity.

Abstract

Training large-scale models relies on a vast number of computing resources. For example, training the GPT-4 model (1.8 trillion parameters) requires 25000 A100 GPUs . It is a challenge to build a large-scale cluster with one type of GPU-accelerator. Using multiple types of GPU-accelerators to construct a large-scale cluster is an effective way to solve the problem of insufficient homogeneous GPU-accelerators. However, the existing distributed training systems for large-scale models only support homogeneous GPU-accelerators, not support heterogeneous GPU-accelerators. To address the problem, this paper proposes a distributed training system with hybrid parallelism, HETHUB, for large-scale models, which supports heterogeneous cluster, including AMD, Nvidia GPU and other types of GPU-accelerators . It introduces a distributed unified communicator to realize the communication between heterogeneous GPU-accelerators, a distributed performance predictor, and an automatic parallel planner to develop and train models efficiently with heterogeneous GPU-accelerators. Compared to the distributed training system with homogeneous GPU-accelerators, our system can support six combinations of heterogeneous GPU-accelerators. We train the Llama-140B model on a heterogeneous cluster with 768 GPU-accelerators(128 AMD and 640 GPU-accelerator A). The experiment results show that the optimal performance of our system in the heterogeneous cluster has achieved up to 97.49% of the theoretical upper bound performance.

HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models

TL;DR

The paper tackles the challenge of training extremely large models on heterogeneous GPU clusters by introducing HETHUB, a distributed training system with hybrid parallelism. It combines an Infinity Collective Communication Library (ICCL) to unify inter-device communication, a distributed performance predictor to estimate strategy performance, and an automatic parallel planner to automatically discover efficient training configurations. The approach is validated on Llama2-140B, achieving up to 97.49% of the theoretical upper bound and demonstrating strong throughput, MFU, and end-to-end improvements across heterogeneous clusters. The work has practical significance by enabling scalable, resource-diverse training of large-scale models with efficient utilization of mixed hardware resources and reduced development complexity.

Abstract

Training large-scale models relies on a vast number of computing resources. For example, training the GPT-4 model (1.8 trillion parameters) requires 25000 A100 GPUs . It is a challenge to build a large-scale cluster with one type of GPU-accelerator. Using multiple types of GPU-accelerators to construct a large-scale cluster is an effective way to solve the problem of insufficient homogeneous GPU-accelerators. However, the existing distributed training systems for large-scale models only support homogeneous GPU-accelerators, not support heterogeneous GPU-accelerators. To address the problem, this paper proposes a distributed training system with hybrid parallelism, HETHUB, for large-scale models, which supports heterogeneous cluster, including AMD, Nvidia GPU and other types of GPU-accelerators . It introduces a distributed unified communicator to realize the communication between heterogeneous GPU-accelerators, a distributed performance predictor, and an automatic parallel planner to develop and train models efficiently with heterogeneous GPU-accelerators. Compared to the distributed training system with homogeneous GPU-accelerators, our system can support six combinations of heterogeneous GPU-accelerators. We train the Llama-140B model on a heterogeneous cluster with 768 GPU-accelerators(128 AMD and 640 GPU-accelerator A). The experiment results show that the optimal performance of our system in the heterogeneous cluster has achieved up to 97.49% of the theoretical upper bound performance.
Paper Structure (20 sections, 3 equations, 8 figures, 1 table)

This paper contains 20 sections, 3 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: The scheme of HETHUB system
  • Figure 2: CPU-based communicator with Ethernet or IPoIB
  • Figure 3: GPU-based communicator with IB
  • Figure 4: The workflow of distributed performance predictor
  • Figure 5: The workflow of automatic parallel planner
  • ...and 3 more figures