Gradient Compression May Hurt Generalization: A Remedy by Synthetic Data Guided Sharpness Aware Minimization

Yujie Gu; Richeng Jin; Zhaoyang Zhang; Huaiyu Dai

Gradient Compression May Hurt Generalization: A Remedy by Synthetic Data Guided Sharpness Aware Minimization

Yujie Gu, Richeng Jin, Zhaoyang Zhang, Huaiyu Dai

TL;DR

This work reveals that gradient compression in federated learning can sharpen loss landscapes and degrade generalization under non-IID data. It proposes FedSynSAM, a SAM-based FL method that uses a trajectory-derived synthetic dataset to more accurately estimate the global perturbation, addressing a key limitation of prior SAM-based FL approaches under compression. The authors provide convergence guarantees for unbiased compressors and demonstrate through extensive experiments that FedSynSAM yields superior accuracy and flatter loss landscapes across multiple datasets and compression schemes, with notable robustness to hyperparameters. The approach offers a practical path to improved generalization in bandwidth-constrained, heterogeneous FL settings thanks to trajectory-inspired data synthesis and SAM integration.

Abstract

It is commonly believed that gradient compression in federated learning (FL) enjoys significant improvement in communication efficiency with negligible performance degradation. In this paper, we find that gradient compression induces sharper loss landscapes in federated learning, particularly under non-IID data distributions, which suggests hindered generalization capability. The recently emerging Sharpness Aware Minimization (SAM) effectively searches for a flat minima by incorporating a gradient ascent step (i.e., perturbing the model with gradients) before the celebrated stochastic gradient descent. Nonetheless, the direct application of SAM in FL suffers from inaccurate estimation of the global perturbation due to data heterogeneity. Existing approaches propose to utilize the model update from the previous communication round as a rough estimate. However, its effectiveness is hindered when model update compression is incorporated. In this paper, we propose FedSynSAM, which leverages the global model trajectory to construct synthetic data and facilitates an accurate estimation of the global perturbation. The convergence of the proposed algorithm is established, and extensive experiments are conducted to validate its effectiveness.

Gradient Compression May Hurt Generalization: A Remedy by Synthetic Data Guided Sharpness Aware Minimization

TL;DR

Abstract

Paper Structure (20 sections, 15 theorems, 48 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 20 sections, 15 theorems, 48 equations, 5 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Communication-Efficient Federated Learning
Sharpness Aware Minimization (SAM)
Proposed Algorithm: FedSynSAM
Rethinking Federated Learning with Gradient Compression.
Accurate Estimation of Global Perturbation via Synthetic Dataset
Overall Algorithm of FedSynSAM
Convergence Analysis
Experiments
Experimental Setups
Main Results
Ablation Study
Conclusion
...and 5 more sections

Key Result

Lemma 1

(Bounded deviation of $\|\nabla F_i (\bm{w} + \hat{\bm{\epsilon}}_i ) - \nabla F(\bm{w} + \bm{\epsilon})\|^2$.) The deviation of local and global gradients with perturbation in FedSynSAM can be bounded as follows: where $\hat{\bm{\epsilon}}_i=\rho \frac{\nabla \hat{F}_i(\bm{w})}{\|\nabla \hat{F}_i(\bm{w})\|}, \bm{\epsilon}=\rho \frac{\nabla F(\bm{w})}{\|\nabla F(\bm{w})\|}$, $\nabla \hat{F}_i(\bm

Figures (5)

Figure 1: Visualization of loss landscapes of FedAvg with and without model update compression, where the transparent purple loss surface arrowed by "w/o comp" in each figure corresponds to FedAvg without compression. The experiments are conducted on the Fashion-MNIST dataset with stochastic quantization alistairh2017qsgd and Top-$k$ sparsification alistairh2018convergence. We allocate the training data to 100 clients following the uniform distribution to simulate IID data distribution, and following the Dirichlet distribution (Dir) hsu2019measuring to simulate data heterogeneity.
Figure 2: Cosine similarity between the true and estimated global perturbation in FedLESAM fan2024locally and FedSynSAM. The experiments are conducted on the CIFAR-10 dataset with 4-bit stochastic quantization alistairh2017qsgd. We simulate data heterogeneity with the Dirichlet distribution (Dir) and Pathological distributions (Path) hsu2019measuring.
Figure 3: Comparison of test accuracy of training ConvNet on CIFAR-10 under the Path(1) non-IID setting with different gradient compressors.
Figure 4: Loss landscape of FedAvg, FedSAM, FedLESAM, and FedSynSAM with 4-bit stochastic quantization under the Path(1) data distribution on CIFAR-10 with full participation of 10 clients.
Figure 5: Impact of perturbation radius $\rho$ on CIFAR-10 with partial participation and no compression.

Theorems & Definitions (19)

Lemma 1
Remark 1
Theorem 1
Theorem 2
Remark 2
Remark 3
Remark 4
Lemma 2
Lemma 3
Lemma 4
...and 9 more

Gradient Compression May Hurt Generalization: A Remedy by Synthetic Data Guided Sharpness Aware Minimization

TL;DR

Abstract

Gradient Compression May Hurt Generalization: A Remedy by Synthetic Data Guided Sharpness Aware Minimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (19)