Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities

Yunze Wei; Tianshuo Hu; Cong Liang; Yong Cui

Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities

Yunze Wei, Tianshuo Hu, Cong Liang, Yong Cui

TL;DR

The paper addresses the bottleneck of communication in distributed training of large DNNs by framing the problem within a three-layer paradigm: Parallelization Strategy, Collective Communication Library, and Network. It surveys representative advances across these layers and identifies limited cross-layer collaboration, advocating a five-layer, cross-layer co-design that enables vertical and horizontal optimization and heterogeneous-resource collaboration (Intra-Inter, Host-Net). The contribution lies in synthesizing current approaches, highlighting open issues, and outlining a practical research direction to reduce job completion time and improve scalability for large-scale training. This framework has practical impact for designing next-generation distributed training systems that better utilize diverse hardware and network resources.

Abstract

The past few years have witnessed the flourishing of large-scale deep neural network models with ever-growing parameter numbers. Training such large-scale models typically requires massive memory and computing resources, necessitating distributed training. As GPU performance has rapidly evolved in recent years, computation time has shrunk, making communication a larger portion of the overall training time. Consequently, optimizing communication for distributed training has become crucial. In this article, we briefly introduce the general architecture of distributed deep neural network training and analyze relationships among Parallelization Strategy, Collective Communication Library, and Network from the perspective of communication optimization, which forms a three-layer paradigm. We then review current representative research advances within this three-layer paradigm. We find that layers in the current three-layer paradigm are relatively independent and there is a rich design space for cross-layer collaborative optimization in distributed training scenarios. Therefore, we advocate "Vertical" and "Horizontal" co-designs which extend the three-layer paradigm to a five-layer paradigm. We also advocate "Intra-Inter" and "Host-Net" co-designs to further utilize the potential of heterogeneous resources. We hope this article can shed some light on future research on communication optimization for distributed training.

Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities

TL;DR

Abstract

Paper Structure (17 sections, 5 figures, 1 table)

This paper contains 17 sections, 5 figures, 1 table.

Introduction
Architecture
Overview
Common Parallelization Strategies
Common Collective Communication Primitives
Underlying Network
Three-Layer Paradigm of Communication Optimization
Overview of Recent Advances
Parallelization Strategy
Collective Communication Library
Network
Collaborative Design in Current Advances
Opportunities
Communication-Efficient Five-Layer Paradigm
Collaborate Design for heterogeneous resources
...and 2 more sections

Figures (5)

Figure 1: Distributed training architecture from a communication perspective.
Figure 2: Four common parallelization strategies: a) Data parallelism; b) Pipeline parallelism; c) Tensor parallelism; d) MoE parallelism.
Figure 3: Four common collective communication primitives: a) BroadCast; b) All-Gather; c) All-to-All; d) All-Reduce. The left side of each figure represents the data state before communication, and the right side represents the state after communication.
Figure 4: TACCL's workflow. The synthesizer takes as input a communication sketch, profiled topology, and target collective along with synthesizer hyperparameters to generate an algorithm for the collective. The synthesized algorithm is implemented in the hardware cluster using TACCL's backend shah2023taccl.
Figure 5: Collaborative design opportunities in distributed training. (a) Communication-efficient five-layer paradigm. (b) Case study of resource competition. The topology is part of a fat-tree with Top-of-Rack (ToR), Aggregation (Agg) and Core three-layer switches. Each host has multiple GPUs and only the left host is detailed. Multiple flows of different training jobs compete for network resources. The switch with the chip logo in the upper right corner represents a programmable switch with in-network aggregation capability.

Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities

TL;DR

Abstract

Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities

Authors

TL;DR

Abstract

Table of Contents

Figures (5)