Table of Contents
Fetching ...

Convergence Analysis of Federated Learning Methods Using Backward Error Analysis

Jinwoo Lim, Suhyun Kim, Soo-Mook Moon

TL;DR

This work extends backward error analysis to federated learning, deriving a modified loss (implicit regularizer) that governs the actual gradient flow under finite-step updates. The authors derive explicit forms for FedAvg, FedSAM, and SCAFFOLD, revealing a dispersion term that increases gradient variance and biases updates, and a second-order term that can modulate convergence, especially under multiple local epochs. Empirical results on MNIST, FEMNIST, and CIFAR-10 validate the theory: FedAvg suffers from dispersion-induced bias and sharper minima, while FedSAM and SCAFFOLD mitigate dispersion to varying degrees, with high-order terms limiting variance-reduction benefits in complex models. Overall, the implicit-regularizer lens provides a complementary, intuition-rich perspective on convergence dynamics in non-IID federated learning and guides discussions on variance-reduction strategies and their limitations.

Abstract

Backward error analysis allows finding a modified loss function, which the parameter updates really follow under the influence of an optimization method. The additional loss terms included in this modified function is called implicit regularizer. In this paper, we attempt to find the implicit regularizer for various federated learning algorithms on non-IID data distribution, and explain why each method shows different convergence behavior. We first show that the implicit regularizer of FedAvg disperses the gradient of each client from the average gradient, thus increasing the gradient variance. We also empirically show that the implicit regularizer hampers its convergence. Similarly, we compute the implicit regularizers of FedSAM and SCAFFOLD, and explain why they converge better. While existing convergence analyses focus on pointing out the advantages of FedSAM and SCAFFOLD, our approach can explain their limitations in complex non-convex settings. In specific, we demonstrate that FedSAM can partially remove the bias in the first-order term of the implicit regularizer in FedAvg, whereas SCAFFOLD can fully eliminate the bias in the first-order term, but not in the second-order term. Consequently, the implicit regularizer can provide a useful insight on the convergence behavior of federated learning from a different theoretical perspective.

Convergence Analysis of Federated Learning Methods Using Backward Error Analysis

TL;DR

This work extends backward error analysis to federated learning, deriving a modified loss (implicit regularizer) that governs the actual gradient flow under finite-step updates. The authors derive explicit forms for FedAvg, FedSAM, and SCAFFOLD, revealing a dispersion term that increases gradient variance and biases updates, and a second-order term that can modulate convergence, especially under multiple local epochs. Empirical results on MNIST, FEMNIST, and CIFAR-10 validate the theory: FedAvg suffers from dispersion-induced bias and sharper minima, while FedSAM and SCAFFOLD mitigate dispersion to varying degrees, with high-order terms limiting variance-reduction benefits in complex models. Overall, the implicit-regularizer lens provides a complementary, intuition-rich perspective on convergence dynamics in non-IID federated learning and guides discussions on variance-reduction strategies and their limitations.

Abstract

Backward error analysis allows finding a modified loss function, which the parameter updates really follow under the influence of an optimization method. The additional loss terms included in this modified function is called implicit regularizer. In this paper, we attempt to find the implicit regularizer for various federated learning algorithms on non-IID data distribution, and explain why each method shows different convergence behavior. We first show that the implicit regularizer of FedAvg disperses the gradient of each client from the average gradient, thus increasing the gradient variance. We also empirically show that the implicit regularizer hampers its convergence. Similarly, we compute the implicit regularizers of FedSAM and SCAFFOLD, and explain why they converge better. While existing convergence analyses focus on pointing out the advantages of FedSAM and SCAFFOLD, our approach can explain their limitations in complex non-convex settings. In specific, we demonstrate that FedSAM can partially remove the bias in the first-order term of the implicit regularizer in FedAvg, whereas SCAFFOLD can fully eliminate the bias in the first-order term, but not in the second-order term. Consequently, the implicit regularizer can provide a useful insight on the convergence behavior of federated learning from a different theoretical perspective.

Paper Structure

This paper contains 42 sections, 4 theorems, 51 equations, 6 figures, 3 algorithms.

Key Result

Theorem 1

If local parameters of clients are discretely updated with a finite learning rate, the expectation of discrete updates of the aggregated parameter in FedAvg follows the modified loss $\tilde{\mathcal{L}}_{FedAvg} (\omega)$ which can be expressed as The approximation holds when $\eta \ll 1 / E$. If $E=1$, the modified loss is the same as the one of SGD.

Figures (6)

  • Figure 1: Test accuracy and variance of client gradients of FedAvg, SCAFFOLD, and SGD on MNIST, and FEMNIST. The final test accuracy is higher and the variance of client gradients is significantly lower when the dispersion term is absent in the modified loss. The convergence behaviours of SCAFFOLD, SGD, and FedAvg without dispersion term are almost identical.
  • Figure 2: Variance of mini-batch gradients in MNIST and FEMNIST.
  • Figure 3: The value of $\varepsilon$ of FedSAM on MNIST and Fashion-MNIST. $\varepsilon$ is consistently lower than $E\eta / 2$.
  • Figure 4: Test accuracy, value of $\varepsilon$, and client gradient variance of FedSAM on MNIST and Fashion-MNIST. $\varepsilon$ was switched to $E \eta / 2$ during training and the convergence speed became faster while the variance of mini-batch gradients decreased. The values at the exact round where switching occurred were omitted for smoothing of the graphs.
  • Figure 5: Test accuracy and client gradient variance for a complex model and dataset. Compared to SGD that no dispersion terms, FedAvg without the first-order dispersion term has a lower accuracy but a lower variance, while SCAFFOLD has a slightly lower accuracy, due to high order terms.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Corollary 2
  • Theorem 3
  • Corollary 4