Table of Contents
Fetching ...

HexiScale: Accommodating Large Language Model Training over Heterogeneous Environment

Ran Yan, Youhe Jiang, Xiaonan Nie, Fangcheng Fu, Bin Cui, Binhang Yuan

TL;DR

HexiScale addresses the inefficiency of training large language models on homogeneous GPU clusters by enabling asymmetric partitioning across data, pipeline, and tensor model parallelism in heterogeneous environments. It formalizes the allocation problem as a constrained optimization and solves it with a two-phase hierarchical graph partitioning algorithm. Empirical results show HexiScale achieves MFU close to homogeneous systems across 7B–30B models and outperforms existing heterogeneous systems like Metis, while offering scalable scheduling in large clusters. The work suggests a practical path toward cost-effective, flexible LLM training using diverse GPUs.

Abstract

Training large language model (LLM) is a computationally intensive task, which is typically conducted in data centers with homogeneous high-performance GPUs. We explore an alternative approach by deploying training computations across heterogeneous GPUs to enable better flexibility and efficiency for heterogeneous resource utilization. To achieve this goal, we propose a novel system, HexiScale, that can flexibly support asymmetric partition of training computations in the scope of data-, pipeline-, and tensor model parallelism. We further formalize the allocation of asymmetric partitioned training computations over a set of heterogeneous GPUs as a constrained optimization problem and propose an efficient hierarchical graph partitioning algorithm. Our approach effectively allocates training computations across GPUs, fully leveraging the available computational power. We conduct empirical studies to evaluate the performance of HexiScale with state-of-the-art homogeneous and heterogeneous training systems. When training LLMs at different scales (from 7B to 30B), HexiScale achieves comparable MFU when running over heterogeneous GPUs compared to state-of-the-art training systems running over homogeneous high-performance GPUs with the same total peak FLOPS. The percentage gaps in MFU between HexiScale and comparable homogeneous settings are as low as $0.3\%$, with an average of $3.5\%$.

HexiScale: Accommodating Large Language Model Training over Heterogeneous Environment

TL;DR

HexiScale addresses the inefficiency of training large language models on homogeneous GPU clusters by enabling asymmetric partitioning across data, pipeline, and tensor model parallelism in heterogeneous environments. It formalizes the allocation problem as a constrained optimization and solves it with a two-phase hierarchical graph partitioning algorithm. Empirical results show HexiScale achieves MFU close to homogeneous systems across 7B–30B models and outperforms existing heterogeneous systems like Metis, while offering scalable scheduling in large clusters. The work suggests a practical path toward cost-effective, flexible LLM training using diverse GPUs.

Abstract

Training large language model (LLM) is a computationally intensive task, which is typically conducted in data centers with homogeneous high-performance GPUs. We explore an alternative approach by deploying training computations across heterogeneous GPUs to enable better flexibility and efficiency for heterogeneous resource utilization. To achieve this goal, we propose a novel system, HexiScale, that can flexibly support asymmetric partition of training computations in the scope of data-, pipeline-, and tensor model parallelism. We further formalize the allocation of asymmetric partitioned training computations over a set of heterogeneous GPUs as a constrained optimization problem and propose an efficient hierarchical graph partitioning algorithm. Our approach effectively allocates training computations across GPUs, fully leveraging the available computational power. We conduct empirical studies to evaluate the performance of HexiScale with state-of-the-art homogeneous and heterogeneous training systems. When training LLMs at different scales (from 7B to 30B), HexiScale achieves comparable MFU when running over heterogeneous GPUs compared to state-of-the-art training systems running over homogeneous high-performance GPUs with the same total peak FLOPS. The percentage gaps in MFU between HexiScale and comparable homogeneous settings are as low as , with an average of .
Paper Structure (26 sections, 2 equations, 11 figures, 6 tables)

This paper contains 26 sections, 2 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Case study on comparing the state-of-the-art training system Megatron and HexiScale. Both systems run their optimal parallel strategies on the given three machines.
  • Figure 2: First phase: the global graph is partitioned into three groups of GPUs by four steps: (i)-coarsen, (ii)-partition, (iii)-project, and (iv)-refine. GPUs in the global graph are divided into three groups which will be constructed as three pipelines.
  • Figure 3: Second phase: each pipeline is created in three steps. (i) GPUs with high bandwidth connections are grouped by graph partition. (ii) intra-group strategy is searched separately for each machine, i.e. GPUs in the same machine. (iii) Pipeline stage order is determined by permuting all intra-group strategies by a top-$\tau$ greedy search algorithm.
  • Figure 4: End-to-end experiments of HexiScale compared with other systems under various experimental settings with Llama-2 (7B) and Llama-2 (13B) models.
  • Figure 5: End-to-end experiments of HexiScale compared with other systems under various experimental settings with Llama (30B) model.
  • ...and 6 more figures