Table of Contents
Fetching ...

Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping

Guanhua Wang, Chengming Zhang, Zheyu Shen, Ang Li, Olatunji Ruwase

TL;DR

Domino tackles the communication bottleneck in tensor-parallel LLM training by introducing row-wise input splitting, column-wise weight splitting, and a hybrid split to enable fine-grained overlap of computation and NCCL collectives. By partitioning data dependencies into independent compute units and pipelining them, Domino hides communication behind computation across both single-node and multi-node TP setups, while remaining compatible with kernel fusion and graph-based acceleration. Empirical results on GPT-3 and Llama-2 models on DGX-H100 demonstrate up to 1.3x throughput improvements over Megatron-LM, with robust gains across model sizes and node counts, and near-optimal performance in several cases. The approach is open-sourced as part of Microsoft DeepSpeed, offering a practical, scalable path to more efficient large-scale LLM training.

Abstract

Given the popularity of generative AI, Large Language Models (LLMs) often consume hundreds or thousands of GPUs for parallelizing and accelerating the training process. Communication overhead becomes more pronounced when training LLMs at scale. To eliminate communication overhead in distributed LLM training, we propose Domino, which provides a generic scheme to hide communication behind computation. By breaking data dependency of a single batch training into smaller independent pieces, Domino pipelines these independent pieces training and provides generic strategy of fine-grained communication and computation overlapping. Extensive results show that, comparing with Megatron-LM, Domino achieves up to 1.3x speedup for LLM training on Nvidia DGX-H100 GPUs.

Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping

TL;DR

Domino tackles the communication bottleneck in tensor-parallel LLM training by introducing row-wise input splitting, column-wise weight splitting, and a hybrid split to enable fine-grained overlap of computation and NCCL collectives. By partitioning data dependencies into independent compute units and pipelining them, Domino hides communication behind computation across both single-node and multi-node TP setups, while remaining compatible with kernel fusion and graph-based acceleration. Empirical results on GPT-3 and Llama-2 models on DGX-H100 demonstrate up to 1.3x throughput improvements over Megatron-LM, with robust gains across model sizes and node counts, and near-optimal performance in several cases. The approach is open-sourced as part of Microsoft DeepSpeed, offering a practical, scalable path to more efficient large-scale LLM training.

Abstract

Given the popularity of generative AI, Large Language Models (LLMs) often consume hundreds or thousands of GPUs for parallelizing and accelerating the training process. Communication overhead becomes more pronounced when training LLMs at scale. To eliminate communication overhead in distributed LLM training, we propose Domino, which provides a generic scheme to hide communication behind computation. By breaking data dependency of a single batch training into smaller independent pieces, Domino pipelines these independent pieces training and provides generic strategy of fine-grained communication and computation overlapping. Extensive results show that, comparing with Megatron-LM, Domino achieves up to 1.3x speedup for LLM training on Nvidia DGX-H100 GPUs.
Paper Structure (33 sections, 6 equations, 13 figures, 1 table)

This paper contains 33 sections, 6 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: GPT-3-13B computation and communication ratio per training iteration over 1 DGX-H100 node (8 H100), 2 nodes (16 H100) and 4 nodes (32 H100) using TP.
  • Figure 2: 4 AllReduce in each transformer block in TP training. Two blank AllReduce boxes are in forward pass, and the other two grey AllReduce boxes are in backward pass.
  • Figure 3: TP computation and communication ratio per training iteration on varied model types and model sizes over 1 to 4 DGX-H100 nodes (8 to 32 H100 GPUs).
  • Figure 4: Forward pass of single Self-Attention / MLP layer.
  • Figure 5: Domino row-wise (batch-dim) split on input $X$.
  • ...and 8 more figures