AdaGossip: Adaptive Consensus Step-size for Decentralized Deep Learning with Communication Compression
Sai Aparna Aketi, Abolfazl Hashemi, Kaushik Roy
TL;DR
AdaGossip addresses the communication bottleneck in decentralized learning with compression by introducing adaptive, per-parameter consensus step-sizes driven by the observed gossip-error. The method computes a second-moment estimate of the gossip-error and uses it to set $\gamma_i^t=\dfrac{\gamma}{\sqrt{u_i^t}+\epsilon}$, enabling parameter-wise adjustment of averaging rates. Extending this to AdaG-SGD, the authors demonstrate consistent improvements (approximately 0.1–2% in test accuracy) over CHOCO-SGD across datasets (CIFAR-{10,100}, ImageNet), architectures (ResNet, LeNet-5, MobileNet-V2), and topologies (ring, Dyck, Torus) under various compression regimes. The findings highlight the practical impact for edge-device training where communication is costly, providing a robust approach to harmonize compression with convergence in decentralized settings.
Abstract
Decentralized learning is crucial in supporting on-device learning over large distributed datasets, eliminating the need for a central server. However, the communication overhead remains a major bottleneck for the practical realization of such decentralized setups. To tackle this issue, several algorithms for decentralized training with compressed communication have been proposed in the literature. Most of these algorithms introduce an additional hyper-parameter referred to as consensus step-size which is tuned based on the compression ratio at the beginning of the training. In this work, we propose AdaGossip, a novel technique that adaptively adjusts the consensus step-size based on the compressed model differences between neighboring agents. We demonstrate the effectiveness of the proposed method through an exhaustive set of experiments on various Computer Vision datasets (CIFAR-10, CIFAR-100, Fashion MNIST, Imagenette, and ImageNet), model architectures, and network topologies. Our experiments show that the proposed method achieves superior performance ($0-2\%$ improvement in test accuracy) compared to the current state-of-the-art method for decentralized learning with communication compression.
