GraVAC: Adaptive Compression for Communication-Efficient Distributed DL Training
Sahil Tyagi, Martin Swany
TL;DR
GraVAC addresses the bottleneck of gradient communication in distributed data-parallel training by adaptively selecting the gradient compression factor (CF) using gradient-variance information. It introduces Compression gain and Compression Throughput to jointly optimize parallel and statistical efficiency, enabling online, black-box CF adaptation without model-specific tuning. Through Exponential and Geometric scaling policies and a multi-level compression strategy, GraVAC matches Dense SGD accuracy in the same iterations while delivering substantial speedups (up to several-fold) across ResNet101, VGG16, and LSTM, and outperforms prior adaptive schemes like Accordion. The approach offers practical impact for large-scale DL deployments by reducing communication overhead without sacrificing convergence, with demonstrated gains on realistic multi-GPU clusters and multiple compression techniques.
Abstract
Distributed data-parallel (DDP) training improves overall application throughput as multiple devices train on a subset of data and aggregate updates to produce a globally shared model. The periodic synchronization at each iteration incurs considerable overhead, exacerbated by the increasing size and complexity of state-of-the-art neural networks. Although many gradient compression techniques propose to reduce communication cost, the ideal compression factor that leads to maximum speedup or minimum data exchange remains an open-ended problem since it varies with the quality of compression, model size and structure, hardware, network topology and bandwidth. We propose GraVAC, a framework to dynamically adjust compression factor throughout training by evaluating model progress and assessing gradient information loss associated with compression. GraVAC works in an online, black-box manner without any prior assumptions about a model or its hyperparameters, while achieving the same or better accuracy than dense SGD (i.e., no compression) in the same number of iterations/epochs. As opposed to using a static compression factor, GraVAC reduces end-to-end training time for ResNet101, VGG16 and LSTM by 4.32x, 1.95x and 6.67x respectively. Compared to other adaptive schemes, our framework provides 1.94x to 5.63x overall speedup.
