Flexible Communication for Optimal Distributed Learning over Unpredictable Networks

Sahil Tyagi; Martin Swany

Flexible Communication for Optimal Distributed Learning over Unpredictable Networks

Sahil Tyagi, Martin Swany

TL;DR

The paper tackles the high communication cost of distributed DNN training under unpredictable networks by introducing AR-Topk, an Allreduce-compatible Top-k gradient compressor, along with STAR-Topk and VAR-Topk worker-selection variants. It couples this compressor with a flexible strategy that switches between Allgather and Allreduce based on an alpha-beta cost model and frames gradient compression as a multi-objective optimization over compression ratio $c$, compression time $t_{comp}$, and communication time $t_{sync}$, aiming to maximize compression gain while preserving convergence. A comprehensive empirical evaluation across ResNet, AlexNet, and ViT demonstrates that dynamic CR and adaptive collectives can outperform static schemes in heterogeneous network conditions, with STAR-Topk typically offering robust performance in moderate cluster sizes and VAR-Topk providing alternatives for unbalanced data or very large clusters. The proposed MOO framework enables adaptive compression that maintains DenseSGD-level accuracy while achieving notable speedups, ultimately delivering practical guidance for scalable training on edge, cloud, and HPC environments.

Abstract

Gradient compression alleviates expensive communication in distributed deep learning by sending fewer values and its corresponding indices, typically via Allgather (AG). Training with high compression ratio (CR) achieves high accuracy like DenseSGD, but has lower parallel scaling due to high communication cost (i.e., parallel efficiency). Using lower CRs improves parallel efficiency by lowering synchronization cost, but degrades model accuracy as well (statistical efficiency). Further, speedup attained with different models and CRs also varies with network latency, effective bandwidth and collective op used for aggregation. In many cases, collectives like Allreduce (AR) have lower cost than AG to exchange the same amount of data. In this paper, we propose an AR-compatible Topk compressor that is bandwidth-optimal and thus performs better than AG in certain network configurations. We develop a flexible communication strategy that switches between AG and AR based on which collective is optimal in the current settings, and model the pareto-relationship between parallel and statistical efficiency as a multi-objective optimization (MOO) problem to dynamically adjust CR and accelerate training while still converging to high accuracy.

Flexible Communication for Optimal Distributed Learning over Unpredictable Networks

TL;DR

, compression time

, and communication time

, aiming to maximize compression gain while preserving convergence. A comprehensive empirical evaluation across ResNet, AlexNet, and ViT demonstrates that dynamic CR and adaptive collectives can outperform static schemes in heterogeneous network conditions, with STAR-Topk typically offering robust performance in moderate cluster sizes and VAR-Topk providing alternatives for unbalanced data or very large clusters. The proposed MOO framework enables adaptive compression that maintains DenseSGD-level accuracy while achieving notable speedups, ultimately delivering practical guidance for scalable training on edge, cloud, and HPC environments.

Abstract

Paper Structure (24 sections, 10 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 24 sections, 10 equations, 8 figures, 6 tables, 1 algorithm.

Introduction
Background And Related Work
Distributed Data-Parallel Training
Intra vs. Inter-node Scaling
Cost of Communication Collectives
Communication Scheduling
Statistical Efficiency of Data-Parallel Training
Gradient Compression to Mitigate Communication Cost
Parallel Efficiency in Gradient Compression
Latency and Bandwidth Variability
Statistical Efficiency in Gradient Compression
AR-Topk Compression
Allreduce-friendly Topk Compression (AR-Topk)
AR-Topk Worker Selection Mechanism
Training Speedup and Convergence in AR-Topk
...and 9 more sections

Figures (8)

Figure 1: (a) Intra-node computation-communication times for 8 workers. (b) Latency of gradient aggregation over 8 GPUs connected as: 8 GPUs/node vs. 1 GPU/node. Inter-node GPUs connected over 10Gbps network.
Figure 2: Compression overhead of different techniques. For the same CR, MSTopk has higher compression cost due to multi-round threshold estimation.
Figure 3: Compression gain measures statistical efficiency in lossy compressors.
Figure 4: Iteration density of the 8 workers (ranked 0-7) that broadcast their topk indices with STAR and VAR-Topk compression. STAR-Topk has similar density across all workers due to its round-robin nature, while some workers may have higher density of updates over others in VAR-Topk.
Figure 5: Communication time increases more steeply with $N$ in Allgather vs. AR-Topk CR 0.1. Network has an average 5ms latency and 1Gbps bandwidth.
...and 3 more figures

Flexible Communication for Optimal Distributed Learning over Unpredictable Networks

TL;DR

Abstract

Flexible Communication for Optimal Distributed Learning over Unpredictable Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (8)