Table of Contents
Fetching ...

SADDLe: Sharpness-Aware Decentralized Deep Learning with Heterogeneous Data

Sakshi Choudhary, Sai Aparna Aketi, Kaushik Roy

TL;DR

SADDLe introduces sharpness-aware decentralized deep learning to tackle non-IID data and communication costs in peer-to-peer settings. By integrating Sharpness-Aware Minimization (SAM) into local updates, SADDLe seeks flatter loss landscapes, which improves generalization and robustness to compression. The framework yields two variants, Q-SADDLe and N-SADDLe, that augment existing decentralized methods with either a global momentum buffer or cross-gradient information, respectively, and theoretical convergence guarantees align with established decentralized rates. Extensive experiments across varied datasets, models, graphs, and compression schemes show consistent 1–20% test accuracy gains and up to 4× compression with only about 1% accuracy loss, highlighting practical impact for on-device and federated-style learning in heterogeneous environments.

Abstract

Decentralized training enables learning with distributed datasets generated at different locations without relying on a central server. In realistic scenarios, the data distribution across these sparsely connected learning agents can be significantly heterogeneous, leading to local model over-fitting and poor global model generalization. Another challenge is the high communication cost of training models in such a peer-to-peer fashion without any central coordination. In this paper, we jointly tackle these two-fold practical challenges by proposing SADDLe, a set of sharpness-aware decentralized deep learning algorithms. SADDLe leverages Sharpness-Aware Minimization (SAM) to seek a flatter loss landscape during training, resulting in better model generalization as well as enhanced robustness to communication compression. We present two versions of our approach and conduct extensive experiments to show that SADDLe leads to 1-20% improvement in test accuracy compared to other existing techniques. Additionally, our proposed approach is robust to communication compression, with an average drop of only 1% in the presence of up to 4x compression.

SADDLe: Sharpness-Aware Decentralized Deep Learning with Heterogeneous Data

TL;DR

SADDLe introduces sharpness-aware decentralized deep learning to tackle non-IID data and communication costs in peer-to-peer settings. By integrating Sharpness-Aware Minimization (SAM) into local updates, SADDLe seeks flatter loss landscapes, which improves generalization and robustness to compression. The framework yields two variants, Q-SADDLe and N-SADDLe, that augment existing decentralized methods with either a global momentum buffer or cross-gradient information, respectively, and theoretical convergence guarantees align with established decentralized rates. Extensive experiments across varied datasets, models, graphs, and compression schemes show consistent 1–20% test accuracy gains and up to 4× compression with only about 1% accuracy loss, highlighting practical impact for on-device and federated-style learning in heterogeneous environments.

Abstract

Decentralized training enables learning with distributed datasets generated at different locations without relying on a central server. In realistic scenarios, the data distribution across these sparsely connected learning agents can be significantly heterogeneous, leading to local model over-fitting and poor global model generalization. Another challenge is the high communication cost of training models in such a peer-to-peer fashion without any central coordination. In this paper, we jointly tackle these two-fold practical challenges by proposing SADDLe, a set of sharpness-aware decentralized deep learning algorithms. SADDLe leverages Sharpness-Aware Minimization (SAM) to seek a flatter loss landscape during training, resulting in better model generalization as well as enhanced robustness to communication compression. We present two versions of our approach and conduct extensive experiments to show that SADDLe leads to 1-20% improvement in test accuracy compared to other existing techniques. Additionally, our proposed approach is robust to communication compression, with an average drop of only 1% in the presence of up to 4x compression.
Paper Structure (28 sections, 7 theorems, 52 equations, 9 figures, 16 tables, 5 algorithms)

This paper contains 28 sections, 7 theorems, 52 equations, 9 figures, 16 tables, 5 algorithms.

Key Result

Theorem 1

Given Assumptions 1-3, for a momentum coefficients $\beta$ and $\mu$, let the learning rate satisfy $\eta \leq \min \left(\frac{\lambda}{7 L}, \frac{1-\beta}{4L} , \frac{(1-\beta)^2(1-\mu)}{\sqrt{12}L\beta}\right)$. For all $T \geq 1$, we have where $C_1= \frac{(2-\beta-\mu)(1-\beta)^2}{(1-\mu)}$, $C_2=\frac{\beta^2}{(1-\mu)(1-\beta)}$, $\Bar{x}$ is the average/consensus model and $\tilde{\eta}=

Figures (9)

  • Figure 1: Loss landscape visualization for QGM (surface) vs Q-SADDLe (mesh) and Comp QGM (surface) vs Comp Q-SADDLe (mesh) for ResNet-20 trained on CIFAR-10 with non-IID data across 10 agents. Comp signifies communication compression through 8-bit stochastic quantization.
  • Figure 2: Impact of flatness on (a) Compression Error and (b) Model Updates for ResNet-20 trained on CIFAR-10 distributed in a non-IID manner across a 10 agent ring topology.
  • Figure 3: Ring Graph (left), and Torus Graph (right).
  • Figure 4: Test accuracy for different levels of quantization-based compression scheme for CIFAR-10 over a 10 agent ring topology.
  • Figure 5: Largest Eigenvalue of the Hessian $(\mathbf{\lambda_{max}})$ at 3 stages of training for ResNet-20 trained on CIFAR-10 in a 10 agent ring topology with $\alpha$= 0.001.
  • ...and 4 more figures

Theorems & Definitions (7)

  • Theorem 1
  • Corollary 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • Lemma 7