Table of Contents
Fetching ...

Gradient Correction in Federated Learning with Adaptive Optimization

Evan Chen, Shiqiang Wang, Jianing Zhang, Dong-Jun Han, Chaoyue Liu, Christopher Brinton

TL;DR

This paper addresses the detrimental effect of data heterogeneity in federated learning when using adaptive optimizers like Adam. It introduces FAdamGC, a gradient-corrected Adam algorithm that injects a pre-estimation drift correction into the local updates and employs gradient-level correction buffers, with selective tracking to reduce communication. The authors provide convergence guarantees for non-convex objectives and show linear speedup under mild assumptions, highlighting improvements over naive SGD-based corrections. Empirical results across image and language tasks demonstrate that FAdamGC achieves faster convergence and better communication-efficiency than baselines, particularly in highly non-i.i.d. settings. The work advances robust, efficient federated optimization by integrating gradient correction with adaptive methods, though it relies on bounded gradient assumptions that future work may relax.

Abstract

In federated learning (FL), model training performance is strongly impacted by data heterogeneity across clients. Client-drift compensation methods have recently emerged as a solution to this issue, introducing correction terms into local model updates. To date, these methods have only been considered under stochastic gradient descent (SGD)-based model training, while modern FL frameworks also employ adaptive optimizers (e.g., Adam) for improved convergence. However, due to the complex interplay between first and second moments found in most adaptive optimization methods, naively injecting correction terms can lead to performance degradation in heterogeneous settings. In this work, we propose {\tt FAdamGC}, the first algorithm to integrate drift compensation into adaptive federated optimization. The key idea of {\tt FAdamGC} is injecting a pre-estimation correction term that aligns with the moment structure of adaptive methods. We provide a rigorous convergence analysis of our algorithm under non-convex settings, showing that {\tt FAdamGC} results in better rate and milder assumptions than naively porting SGD-based correction algorithms into adaptive optimizers. Our experimental results demonstrate that {\tt FAdamGC} consistently outperform existing methods in total communication and computation cost across varying levels of data heterogeneity, showing the efficacy of correcting gradient information in federated adaptive optimization.

Gradient Correction in Federated Learning with Adaptive Optimization

TL;DR

This paper addresses the detrimental effect of data heterogeneity in federated learning when using adaptive optimizers like Adam. It introduces FAdamGC, a gradient-corrected Adam algorithm that injects a pre-estimation drift correction into the local updates and employs gradient-level correction buffers, with selective tracking to reduce communication. The authors provide convergence guarantees for non-convex objectives and show linear speedup under mild assumptions, highlighting improvements over naive SGD-based corrections. Empirical results across image and language tasks demonstrate that FAdamGC achieves faster convergence and better communication-efficiency than baselines, particularly in highly non-i.i.d. settings. The work advances robust, efficient federated optimization by integrating gradient correction with adaptive methods, though it relies on bounded gradient assumptions that future work may relax.

Abstract

In federated learning (FL), model training performance is strongly impacted by data heterogeneity across clients. Client-drift compensation methods have recently emerged as a solution to this issue, introducing correction terms into local model updates. To date, these methods have only been considered under stochastic gradient descent (SGD)-based model training, while modern FL frameworks also employ adaptive optimizers (e.g., Adam) for improved convergence. However, due to the complex interplay between first and second moments found in most adaptive optimization methods, naively injecting correction terms can lead to performance degradation in heterogeneous settings. In this work, we propose {\tt FAdamGC}, the first algorithm to integrate drift compensation into adaptive federated optimization. The key idea of {\tt FAdamGC} is injecting a pre-estimation correction term that aligns with the moment structure of adaptive methods. We provide a rigorous convergence analysis of our algorithm under non-convex settings, showing that {\tt FAdamGC} results in better rate and milder assumptions than naively porting SGD-based correction algorithms into adaptive optimizers. Our experimental results demonstrate that {\tt FAdamGC} consistently outperform existing methods in total communication and computation cost across varying levels of data heterogeneity, showing the efficacy of correcting gradient information in federated adaptive optimization.

Paper Structure

This paper contains 17 sections, 8 theorems, 78 equations, 9 figures, 5 tables, 1 algorithm.

Key Result

Theorem 5.2

Let $\beta_2 = \epsilon = 0$, by selecting $\eta_g\eta_l = \min\{\frac{\sqrt{\mathcal{F}n}}{\sqrt{\sigma^2 KTL}}, \frac{\mathcal{F}}{T}\}$, $\beta = \sqrt[K]{\frac{KN - 2T}{2KN}}$, $\eta_l \leq \frac{1}{T}$, under Assumption assump:genLoss, the iterates of FAdamGC can be bounded as:

Figures (9)

  • Figure 1: Visualization of the local update process under adaptive optimization with gradient correction. While adaptive methods help smooth the optimization trajectory, clients may still drift toward local optima due to data heterogeneity, preventing them from reaching globally optimal solutions even with federated cooperation. Gradient correction steers updates toward the global objective, mitigating client-drift to stabilize training. This combination blends the fast convergence of adaptive optimizers and the stability of correction-based methods.
  • Figure 2: Comparison of achieved accuracy over global iterations and run time on CIFAR-100 and 20NewsGroups. FAdamGC steadily outperform baselines under different evaluation methods.
  • Figure 3: Comparison of the total cost of Adam-based methods under varying Dirichlet parameters on CIFAR-100 to attain $50\%$ accuracy.
  • Figure 4: Comparison of cost to attain certain accuracy between different tracking sampling rates on CIFAR-100 with $S = 50$, where the target accuracy is $50\%$.
  • Figure 5: Experimental results on CIFAR100 under different sample rate of clients and $K = 60$.
  • ...and 4 more figures

Theorems & Definitions (14)

  • Theorem 5.2
  • Theorem 5.4
  • proof
  • Lemma A.1
  • proof
  • Lemma B.1
  • proof
  • Theorem C.2
  • Theorem C.3
  • proof
  • ...and 4 more