Table of Contents
Fetching ...

H2:Towards Efficient Large-Scale LLM Training on Hyper-Heterogeneous Cluster over 1,000 Chips

Ding Tang, Jiecheng Zhou, Jiakai Hu, Shengwei Li, Huihuang Zheng, Zhilin Pei, Hui Wang, Xingcheng Zhang

TL;DR

This work tackles the challenge of training extremely large language models on hyper-heterogeneous clusters with thousands of chips from multiple vendors. It introduces H2, a unified framework comprising DiTorch for cross-chip operator consistency, DiComm for device-direct RDMA communication, and HeteroPP with HeteroAuto for adaptive, topology-aware pipeline parallelism. Through a cost-model-driven, two-stage automatic search and topology-aware optimizations, the approach demonstrates significant throughput gains and stability on a 100B-parameter model across over 1,000 heterogeneous devices. The results indicate that intelligent heterogeneity-aware design can outperform traditional homogeneous training while reducing idle time and improving resource utilization, offering practical pathways for scalable, cost-efficient large-scale training.

Abstract

Recent advancements in large language models (LLMs) necessitate extensive computational resources, prompting the use of diverse hardware accelerators from multiple vendors. However, traditional distributed training frameworks struggle to efficiently utilize hyper-heterogeneous clusters comprising thousands of chips due to significant disparities in software stacks, operator implementations, communication libraries, and hardware capabilities. To address these challenges, we propose H2, which stands for HyperHetero and is a systematic framework enabling efficient training of LLMs on clusters with over 1,000 heterogeneous chips. H2 incorporates DiTorch, a unified PyTorch-compatible interface ensuring program consistency across chips, and DiComm, a device-direct RDMA communication library optimized for heterogeneous environments. Furthermore, we introduce HeteroPP with HeteroAuto, an adaptive pipeline parallelism strategy that dynamically balances computational load, memory limitations, and communication overhead. Evaluations on a 100-billion-parameter LLM demonstrate that our approach consistently achieves a superlinear speedup, outperforming baseline homogeneous training solutions by up to 16.37% in our experiments. These findings validate the feasibility and efficiency of hyper-heterogeneous training at unprecedented scales.

H2:Towards Efficient Large-Scale LLM Training on Hyper-Heterogeneous Cluster over 1,000 Chips

TL;DR

This work tackles the challenge of training extremely large language models on hyper-heterogeneous clusters with thousands of chips from multiple vendors. It introduces H2, a unified framework comprising DiTorch for cross-chip operator consistency, DiComm for device-direct RDMA communication, and HeteroPP with HeteroAuto for adaptive, topology-aware pipeline parallelism. Through a cost-model-driven, two-stage automatic search and topology-aware optimizations, the approach demonstrates significant throughput gains and stability on a 100B-parameter model across over 1,000 heterogeneous devices. The results indicate that intelligent heterogeneity-aware design can outperform traditional homogeneous training while reducing idle time and improving resource utilization, offering practical pathways for scalable, cost-efficient large-scale training.

Abstract

Recent advancements in large language models (LLMs) necessitate extensive computational resources, prompting the use of diverse hardware accelerators from multiple vendors. However, traditional distributed training frameworks struggle to efficiently utilize hyper-heterogeneous clusters comprising thousands of chips due to significant disparities in software stacks, operator implementations, communication libraries, and hardware capabilities. To address these challenges, we propose H2, which stands for HyperHetero and is a systematic framework enabling efficient training of LLMs on clusters with over 1,000 heterogeneous chips. H2 incorporates DiTorch, a unified PyTorch-compatible interface ensuring program consistency across chips, and DiComm, a device-direct RDMA communication library optimized for heterogeneous environments. Furthermore, we introduce HeteroPP with HeteroAuto, an adaptive pipeline parallelism strategy that dynamically balances computational load, memory limitations, and communication overhead. Evaluations on a 100-billion-parameter LLM demonstrate that our approach consistently achieves a superlinear speedup, outperforming baseline homogeneous training solutions by up to 16.37% in our experiments. These findings validate the feasibility and efficiency of hyper-heterogeneous training at unprecedented scales.

Paper Structure

This paper contains 31 sections, 5 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Comparison of chip specifications between capability-incremental and our hyper-heterogeneous scenario.In traditional heterogeneous scenarios, as indicated by the black dashed circles in the figure, chips show a trend of increasing capabilities in computation, communication, and memory. In contrast, in hyper-heterogeneous scenarios, as indicated by the red dashed circles in the figure, the capabilities of chips in these three aspects do not follow any specific pattern.
  • Figure 2: LLM Pre-training Software Stack. Red parts represent our work.
  • Figure 3: Intra-Node Bandwidth Performance in Different GPU Servers
  • Figure 4: System Overview of DiTorch and DiComm.
  • Figure 5: DiTorch's precision alignment across Chips A, B, C, and D compared to the NVIDIA A100.
  • ...and 7 more figures