On Optimizing the Communication of Model Parallelism
Yonghao Zhuang, Hexu Zhao, Lianmin Zheng, Zhuohan Li, Eric P. Xing, Qirong Ho, Joseph E. Gonzalez, Ion Stoica, Hao Zhang
TL;DR
The paper tackles cross-mesh resharding, a key communication challenge in combined intra-op and inter-op model parallelism for large-scale DL. It formalizes the problem as many-to-many multicast between non-overlapping device meshes and proposes broadcast-based resharding plus an overlap-friendly pipeline schedule (eager-1F1B) to hide communication latency. The authors implement these strategies in the AlpaComm library and demonstrate up to 10× microbenchmark improvements and up to 1.5× end-to-end throughput gains on GPT-3-like and U-Transformer models, approaching an upper bound defined by a hypothetical all-to-all Send/Recv upper bound. The work provides a practical, generalizable optimization framework for cross-mesh data movement and layout conversion, enabling more scalable and efficient large-model training across heterogeneous cluster topologies.
Abstract
We study a novel and important communication pattern in large-scale model-parallel deep learning (DL), which we call cross-mesh resharding. This pattern emerges when the two paradigms of model parallelism - intra-operator and inter-operator parallelism - are combined to support large models on large clusters. In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh, on which the tensor may be distributed with the same or different layouts. We formalize this as a many-to-many multicast communication problem, and show that existing approaches either are sub-optimal or do not generalize to different network topologies or tensor layouts, which result from different model architectures and parallelism strategies. We then propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule. On microbenchmarks, our overall system outperforms existing ones by up to 10x across various tensor and mesh layouts. On end-to-end training of two large models, GPT-3 and U-Transformer, we improve throughput by 10% and 50%, respectively.
