Table of Contents
Fetching ...

GraVAC: Adaptive Compression for Communication-Efficient Distributed DL Training

Sahil Tyagi, Martin Swany

TL;DR

GraVAC addresses the bottleneck of gradient communication in distributed data-parallel training by adaptively selecting the gradient compression factor (CF) using gradient-variance information. It introduces Compression gain and Compression Throughput to jointly optimize parallel and statistical efficiency, enabling online, black-box CF adaptation without model-specific tuning. Through Exponential and Geometric scaling policies and a multi-level compression strategy, GraVAC matches Dense SGD accuracy in the same iterations while delivering substantial speedups (up to several-fold) across ResNet101, VGG16, and LSTM, and outperforms prior adaptive schemes like Accordion. The approach offers practical impact for large-scale DL deployments by reducing communication overhead without sacrificing convergence, with demonstrated gains on realistic multi-GPU clusters and multiple compression techniques.

Abstract

Distributed data-parallel (DDP) training improves overall application throughput as multiple devices train on a subset of data and aggregate updates to produce a globally shared model. The periodic synchronization at each iteration incurs considerable overhead, exacerbated by the increasing size and complexity of state-of-the-art neural networks. Although many gradient compression techniques propose to reduce communication cost, the ideal compression factor that leads to maximum speedup or minimum data exchange remains an open-ended problem since it varies with the quality of compression, model size and structure, hardware, network topology and bandwidth. We propose GraVAC, a framework to dynamically adjust compression factor throughout training by evaluating model progress and assessing gradient information loss associated with compression. GraVAC works in an online, black-box manner without any prior assumptions about a model or its hyperparameters, while achieving the same or better accuracy than dense SGD (i.e., no compression) in the same number of iterations/epochs. As opposed to using a static compression factor, GraVAC reduces end-to-end training time for ResNet101, VGG16 and LSTM by 4.32x, 1.95x and 6.67x respectively. Compared to other adaptive schemes, our framework provides 1.94x to 5.63x overall speedup.

GraVAC: Adaptive Compression for Communication-Efficient Distributed DL Training

TL;DR

GraVAC addresses the bottleneck of gradient communication in distributed data-parallel training by adaptively selecting the gradient compression factor (CF) using gradient-variance information. It introduces Compression gain and Compression Throughput to jointly optimize parallel and statistical efficiency, enabling online, black-box CF adaptation without model-specific tuning. Through Exponential and Geometric scaling policies and a multi-level compression strategy, GraVAC matches Dense SGD accuracy in the same iterations while delivering substantial speedups (up to several-fold) across ResNet101, VGG16, and LSTM, and outperforms prior adaptive schemes like Accordion. The approach offers practical impact for large-scale DL deployments by reducing communication overhead without sacrificing convergence, with demonstrated gains on realistic multi-GPU clusters and multiple compression techniques.

Abstract

Distributed data-parallel (DDP) training improves overall application throughput as multiple devices train on a subset of data and aggregate updates to produce a globally shared model. The periodic synchronization at each iteration incurs considerable overhead, exacerbated by the increasing size and complexity of state-of-the-art neural networks. Although many gradient compression techniques propose to reduce communication cost, the ideal compression factor that leads to maximum speedup or minimum data exchange remains an open-ended problem since it varies with the quality of compression, model size and structure, hardware, network topology and bandwidth. We propose GraVAC, a framework to dynamically adjust compression factor throughout training by evaluating model progress and assessing gradient information loss associated with compression. GraVAC works in an online, black-box manner without any prior assumptions about a model or its hyperparameters, while achieving the same or better accuracy than dense SGD (i.e., no compression) in the same number of iterations/epochs. As opposed to using a static compression factor, GraVAC reduces end-to-end training time for ResNet101, VGG16 and LSTM by 4.32x, 1.95x and 6.67x respectively. Compared to other adaptive schemes, our framework provides 1.94x to 5.63x overall speedup.
Paper Structure (18 sections, 8 equations, 11 figures, 4 tables, 1 algorithm)

This paper contains 18 sections, 8 equations, 11 figures, 4 tables, 1 algorithm.

Figures (11)

  • Figure 1: Communication overhead and early critical period in DDP training.
  • Figure 2: CF with maximal speedup (to reach Table \ref{['table:models']} targets) varies for each model and compression technique used. The results are normalized by 10$\mathsf{x}$ CF while a speedup of 0.0 implies convergence failure.
  • Figure 3: Throughput and communication speedup for layerwise DGC compression, normalized by 10$\mathsf{x}$ CF.
  • Figure 4: ResNet101: Prior and Post-Compression gradients, test accuracy and compression gain for CFs 10$\mathsf{x}$ and 1000$\mathsf{x}$.
  • Figure 5: VGG16: Prior and Post-Compression gradients, test accuracy and compression gain for CFs 10$\mathsf{x}$ and 1000$\mathsf{x}$.
  • ...and 6 more figures