On Optimizing the Communication of Model Parallelism

Yonghao Zhuang; Hexu Zhao; Lianmin Zheng; Zhuohan Li; Eric P. Xing; Qirong Ho; Joseph E. Gonzalez; Ion Stoica; Hao Zhang

On Optimizing the Communication of Model Parallelism

Yonghao Zhuang, Hexu Zhao, Lianmin Zheng, Zhuohan Li, Eric P. Xing, Qirong Ho, Joseph E. Gonzalez, Ion Stoica, Hao Zhang

TL;DR

The paper tackles cross-mesh resharding, a key communication challenge in combined intra-op and inter-op model parallelism for large-scale DL. It formalizes the problem as many-to-many multicast between non-overlapping device meshes and proposes broadcast-based resharding plus an overlap-friendly pipeline schedule (eager-1F1B) to hide communication latency. The authors implement these strategies in the AlpaComm library and demonstrate up to 10× microbenchmark improvements and up to 1.5× end-to-end throughput gains on GPT-3-like and U-Transformer models, approaching an upper bound defined by a hypothetical all-to-all Send/Recv upper bound. The work provides a practical, generalizable optimization framework for cross-mesh data movement and layout conversion, enabling more scalable and efficient large-model training across heterogeneous cluster topologies.

Abstract

We study a novel and important communication pattern in large-scale model-parallel deep learning (DL), which we call cross-mesh resharding. This pattern emerges when the two paradigms of model parallelism - intra-operator and inter-operator parallelism - are combined to support large models on large clusters. In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh, on which the tensor may be distributed with the same or different layouts. We formalize this as a many-to-many multicast communication problem, and show that existing approaches either are sub-optimal or do not generalize to different network topologies or tensor layouts, which result from different model architectures and parallelism strategies. We then propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule. On microbenchmarks, our overall system outperforms existing ones by up to 10x across various tensor and mesh layouts. On end-to-end training of two large models, GPT-3 and U-Transformer, we improve throughput by 10% and 50%, respectively.

On Optimizing the Communication of Model Parallelism

TL;DR

Abstract

Paper Structure (31 sections, 2 equations, 11 figures, 4 tables)

This paper contains 31 sections, 2 equations, 11 figures, 4 tables.

Introduction
Background
Model Parallelism
Cross-mesh Resharding
Optimizations for a single cross-mesh resharding task
Optimizing individual unit communication task
Load balance and schedule for multiple unit communication tasks
Overlapping-friendly Pipeline Schedule
Evaluation
Microbenchmark
Single device to multiple devices
Multiple devices to multiple devices
End-to-End Performance
Ablation Study
Load balance
...and 16 more sections

Figures (11)

Figure 1: Model parallelism applies to an MLP, where each node represents an operator and its output tensor, and solid arrows indicate data flowing directions. The node with dotted boundaries in each of (b)-(d) shows the required input layout to matmul2. When combining two parallelisms in (d), cross-mesh resharding emerges for both exchanging tensors and converting their layouts between two meshes.
Figure 2: Two examples of cross-mesh resharding. Given a $4 \times 4$ matrix and two $2 \times 2$ meshes $Mesh_A$ and $Mesh_B$, each dotted box shows a sharding spec and its resulting tensor layout on each device of the mesh (note that $Mesh_A$ is used twice: in the first and third dotted boxes). Two cross-mesh resharding tasks are formed when communicating between Specs 1 and 2 and between Spec 2 and 3.
Figure 3: Communication strategies for an individual communication task. The orange arrows represent cross-host communication and the numbers on it represent the latency of the communication. The blue arrows represent intra-node communication.
Figure 4: Timelines of 1F1B and eager-1F1B schedule. Each row is a pipeline stage. Numbers indicate micro-batch indices. Micro-batch 4's communication from stage 2 to stage 3 is magnified.
Figure 5: Single device to multiple devices microbenchmark result.
...and 6 more figures

On Optimizing the Communication of Model Parallelism

TL;DR

Abstract

On Optimizing the Communication of Model Parallelism

Authors

TL;DR

Abstract

Table of Contents

Figures (11)