Table of Contents
Fetching ...

Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution

Haiquan Wang, Chaoyi Ruan, Jia He, Jiaqi Ruan, Chengjie Tang, Xiaosong Ma, Cheng Li

TL;DR

DHelix is introduced, a novel micro-structure that dramatically improves the efficiency of LLM training inspired by the DNA structure and seamlessly integrates with all forms of existing data/model parallelism, the most challenging being pipeline parallelism, thanks to its unique model folding design that results in a W-shaped pipeline.

Abstract

The growth of Large Language Models (LLMs) has necessitated large-scale distributed training. Highly optimized frameworks, however, still suffer significant losses in Model FLOPS utilization (often below 50%) due to large communication volumes. Meanwhile, our comprehensive profiling shows that the computation- and communication-intensive operators overlap well. This paper introduces DHelix, a novel micro-structure that dramatically improves the efficiency of LLM training inspired by the DNA structure. Central to DHelix's design is Strand Interleaving (SI), which views the continuous stream of training micro-batches through a GPU as two strands. DHelix juxtaposes the forward and backward passes of the two strands and performs a systematic optimization for an SI plan that co-schedules the operators from the opposite strands, enabled by operator-level overlap profiling results and a dynamic-programming based search algorithm. Meanwhile, DHelix enables the two strands to share model states and space for activation data, effectively accommodating two micro-batches with under 3% extra memory space. Dhelix seamlessly integrates with all forms of existing data/model parallelism, the most challenging being pipeline parallelism, thanks to its unique model folding design that results in a W-shaped pipeline. We evaluate DHelix training with the popular Llama and GPT dense models, plus the Phi Mixture of Expert (MoE) model, across 3 GPU clusters (A40, A800, and H100). Results show that it achieves 12-40% (up to 58% MFU) and 2-29% (up to 71% MFU) improvement on the 64-A40 and 64-A800 clusters, respectively, significantly outperforming state-of-the-art methods. On the H100 cluster, though the faster network reduces DHelix's profit margin, it makes cross-node tensor parallelism promising, a practice currently prohibitive due to communication costs.

Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution

TL;DR

DHelix is introduced, a novel micro-structure that dramatically improves the efficiency of LLM training inspired by the DNA structure and seamlessly integrates with all forms of existing data/model parallelism, the most challenging being pipeline parallelism, thanks to its unique model folding design that results in a W-shaped pipeline.

Abstract

The growth of Large Language Models (LLMs) has necessitated large-scale distributed training. Highly optimized frameworks, however, still suffer significant losses in Model FLOPS utilization (often below 50%) due to large communication volumes. Meanwhile, our comprehensive profiling shows that the computation- and communication-intensive operators overlap well. This paper introduces DHelix, a novel micro-structure that dramatically improves the efficiency of LLM training inspired by the DNA structure. Central to DHelix's design is Strand Interleaving (SI), which views the continuous stream of training micro-batches through a GPU as two strands. DHelix juxtaposes the forward and backward passes of the two strands and performs a systematic optimization for an SI plan that co-schedules the operators from the opposite strands, enabled by operator-level overlap profiling results and a dynamic-programming based search algorithm. Meanwhile, DHelix enables the two strands to share model states and space for activation data, effectively accommodating two micro-batches with under 3% extra memory space. Dhelix seamlessly integrates with all forms of existing data/model parallelism, the most challenging being pipeline parallelism, thanks to its unique model folding design that results in a W-shaped pipeline. We evaluate DHelix training with the popular Llama and GPT dense models, plus the Phi Mixture of Expert (MoE) model, across 3 GPU clusters (A40, A800, and H100). Results show that it achieves 12-40% (up to 58% MFU) and 2-29% (up to 71% MFU) improvement on the 64-A40 and 64-A800 clusters, respectively, significantly outperforming state-of-the-art methods. On the H100 cluster, though the faster network reduces DHelix's profit margin, it makes cross-node tensor parallelism promising, a practice currently prohibitive due to communication costs.

Paper Structure

This paper contains 28 sections, 2 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Double-strand execution in DHelix on 4 GPUs
  • Figure 2: Sample result operator overlapping schedule by methods proposed in MegaScale jiang2024megascale, captured using the NVIDIA nsight profiling tool Nsight, in comparison to the execution follow achieved by Megatron-LM (top)
  • Figure 3: Sample execution time breakdown in training different transformer-based models and parameter sizes
  • Figure 4: The overlap effectiveness is achieved through operator overlap. C1 represents local-node AllGather, while C2 denotes cross-node All-to-All. All operators, except for C2, are derived from the Llama 70B. A complete pairwise table can be found in the repository H100-overlap-eff , reporting the overlap effectiveness among 14 compute and 10 communication operators.
  • Figure 5: Sample memory allocation breakdown in training Llama-25B model, with 8192 sequence length and micro-batch size 1, on 64 A40 GPUs with parallelism strategy of DP=8 and TP=8.
  • ...and 11 more figures