SADDLe: Sharpness-Aware Decentralized Deep Learning with Heterogeneous Data
Sakshi Choudhary, Sai Aparna Aketi, Kaushik Roy
TL;DR
SADDLe introduces sharpness-aware decentralized deep learning to tackle non-IID data and communication costs in peer-to-peer settings. By integrating Sharpness-Aware Minimization (SAM) into local updates, SADDLe seeks flatter loss landscapes, which improves generalization and robustness to compression. The framework yields two variants, Q-SADDLe and N-SADDLe, that augment existing decentralized methods with either a global momentum buffer or cross-gradient information, respectively, and theoretical convergence guarantees align with established decentralized rates. Extensive experiments across varied datasets, models, graphs, and compression schemes show consistent 1–20% test accuracy gains and up to 4× compression with only about 1% accuracy loss, highlighting practical impact for on-device and federated-style learning in heterogeneous environments.
Abstract
Decentralized training enables learning with distributed datasets generated at different locations without relying on a central server. In realistic scenarios, the data distribution across these sparsely connected learning agents can be significantly heterogeneous, leading to local model over-fitting and poor global model generalization. Another challenge is the high communication cost of training models in such a peer-to-peer fashion without any central coordination. In this paper, we jointly tackle these two-fold practical challenges by proposing SADDLe, a set of sharpness-aware decentralized deep learning algorithms. SADDLe leverages Sharpness-Aware Minimization (SAM) to seek a flatter loss landscape during training, resulting in better model generalization as well as enhanced robustness to communication compression. We present two versions of our approach and conduct extensive experiments to show that SADDLe leads to 1-20% improvement in test accuracy compared to other existing techniques. Additionally, our proposed approach is robust to communication compression, with an average drop of only 1% in the presence of up to 4x compression.
