Table of Contents
Fetching ...

Stochastic-Sign SGD for Federated Learning with Theoretical Guarantees

Richeng Jin, Yufan Huang, Xiaofan He, Huaiyu Dai, Tianfu Wu

TL;DR

The paper tackles federated learning under practical constraints by introducing Stochastic-Sign SGD, which uses stochastic gradient compressors to enable convergence despite data heterogeneity, while also supporting differential privacy via dp-sign. It provides rigorous theoretical guarantees for convergence to a neighborhood of the optimum, quantifies Byzantine resilience, and proposes enhancements such as weighted voting and Top-K sparsification to boost robustness and privacy-accuracy trade-offs. An error-feedback variant further improves learning by compensating compression-induced errors, with extended results to SGD and privacy-preserving settings. Empirical validation on MNIST and CIFAR-10 demonstrates competitive accuracy under communication constraints and resilience scenarios, highlighting the method’s practicality for large-scale, heterogeneous, and potentially adversarial FL deployments.

Abstract

Federated learning (FL) has emerged as a prominent distributed learning paradigm. FL entails some pressing needs for developing novel parameter estimation approaches with theoretical guarantees of convergence, which are also communication efficient, differentially private and Byzantine resilient in the heterogeneous data distribution settings. Quantization-based SGD solvers have been widely adopted in FL and the recently proposed SIGNSGD with majority vote shows a promising direction. However, no existing methods enjoy all the aforementioned properties. In this paper, we propose an intuitively-simple yet theoretically-sound method based on SIGNSGD to bridge the gap. We present Stochastic-Sign SGD which utilizes novel stochastic-sign based gradient compressors enabling the aforementioned properties in a unified framework. We also present an error-feedback variant of the proposed Stochastic-Sign SGD which further improves the learning performance in FL. We test the proposed method with extensive experiments using deep neural networks on the MNIST dataset and the CIFAR-10 dataset. The experimental results corroborate the effectiveness of the proposed method.

Stochastic-Sign SGD for Federated Learning with Theoretical Guarantees

TL;DR

The paper tackles federated learning under practical constraints by introducing Stochastic-Sign SGD, which uses stochastic gradient compressors to enable convergence despite data heterogeneity, while also supporting differential privacy via dp-sign. It provides rigorous theoretical guarantees for convergence to a neighborhood of the optimum, quantifies Byzantine resilience, and proposes enhancements such as weighted voting and Top-K sparsification to boost robustness and privacy-accuracy trade-offs. An error-feedback variant further improves learning by compensating compression-induced errors, with extended results to SGD and privacy-preserving settings. Empirical validation on MNIST and CIFAR-10 demonstrates competitive accuracy under communication constraints and resilience scenarios, highlighting the method’s practicality for large-scale, heterogeneous, and potentially adversarial FL deployments.

Abstract

Federated learning (FL) has emerged as a prominent distributed learning paradigm. FL entails some pressing needs for developing novel parameter estimation approaches with theoretical guarantees of convergence, which are also communication efficient, differentially private and Byzantine resilient in the heterogeneous data distribution settings. Quantization-based SGD solvers have been widely adopted in FL and the recently proposed SIGNSGD with majority vote shows a promising direction. However, no existing methods enjoy all the aforementioned properties. In this paper, we propose an intuitively-simple yet theoretically-sound method based on SIGNSGD to bridge the gap. We present Stochastic-Sign SGD which utilizes novel stochastic-sign based gradient compressors enabling the aforementioned properties in a unified framework. We also present an error-feedback variant of the proposed Stochastic-Sign SGD which further improves the learning performance in FL. We test the proposed method with extensive experiments using deep neural networks on the MNIST dataset and the CIFAR-10 dataset. The experimental results corroborate the effectiveness of the proposed method.

Paper Structure

This paper contains 34 sections, 26 theorems, 104 equations, 5 figures, 1 table, 4 algorithms.

Key Result

Theorem 1

Let $u_{1},u_{2},\cdots,u_{M}$ be $M$ known and fixed real numbers and consider binary random variables $\hat{u}_{m}$, $1\leq m \leq M$. Suppose that $\Bar{p} = \frac{1}{M}\sum_{m=1}^{M}\Pr\left(sign\left(\frac{1}{M}\sum_{m=1}^{M}u_{m}\right) \neq \hat{u}_{m}\right) < \frac{1}{2}$, we have

Figures (5)

  • Figure 1: The two figures in the first and the second rows show the performance of Sto-SIGNSGD on MNIST and CIFAR-10, respectively. All the presented results are averaged over 5 repeats. The first column shows the training and the testing accuracy of Sto-SIGNSGD for different $\boldsymbol{b}=b\cdot\boldsymbol{1}$. We run 200 and 8,000 communication rounds for MNIST and CIFAR-10, respectively. The second column compares the testing accuracy of Sto-SIGNSGD with SIGNSGD and FedAvg mcmahan2017communication with respect to the total communication overhead. FedAvg uses a learning rate decay of 0.99 and 0.996 per communication round for MNIST and CIFAR-10, respectively. We tune the number of local iterations from the set {1, 5, 10, 20} and present the results with the best final testing accuracy.
  • Figure 2: The left and right figures show the testing accuracy of Sto-SIGNSGD for different number of Byzantine workers and different $\boldsymbol{b}$ on MNIST and CIFAR-10, respectively. For MNIST, the Byzantine workers evaluate their gradients over the whole training dataset. For CIFAR-10, each Byzantine worker has 2,000 training examples that are sampled from the training dataset uniformly at random. The mini-batch sizes of all the workers and the Byzantine attackers are set to 32.
  • Figure 3: The left figure shows the testing accuracy of Sto-SIGNSGD with weighted vote on MNIST with $\boldsymbol{b}=0.03\cdot\boldsymbol{1}$ and different number of Byzantine workers that evaluate their gradients over the whole training dataset. The right figure shows the testing accuracy of Sto-SIGNSGD with weighted vote on CIFAR-10, with $\boldsymbol{b}=0.012\cdot\boldsymbol{1}$. Each Byzantine worker has 2,000 training examples that are sampled from the training dataset uniformly at random. The mini-batch sizes of all the workers and the Byzantine attackers are set to 32.
  • Figure 4: Performance of DP-TopSIGNSGD, DP-SIGNSGD and DP-FedSGD on MNIST. We follow the idea of gradient clipping in abadi2016deep to bound the sensitivity $\Delta_{2}$. After computing the gradient for each individual training sample in the local dataset, each worker clips it in its $L_{2}$ norm for a clipping threshold $C$ to ensure that $\Delta_{2}\leq C$. We set $C=4$ in the experiments.
  • Figure 5: The first figure shows the performance of DP-SIGNSGD and EF-DP-SIGNSGD on MNIST for different $\epsilon$ when $\delta = 10^{-5}$, without Byzantine attackers. The $\epsilon$'s measure the per epoch privacy guarantee of the algorithms. The second figure compares Sto-SIGNSGD with EF-Sto-SIGNSGD on MNIST with $\boldsymbol{b}=0.02\cdot\boldsymbol{1}$. The last figure compares Sto-SIGNSGD with EF-Sto-SIGNSGD on CIFAR-10 with optimal $\boldsymbol{b}$.

Theorems & Definitions (56)

  • Theorem 1
  • Remark 1
  • Definition 1
  • Corollary 1
  • Remark 2
  • Theorem 2
  • Remark 3
  • Theorem 3
  • Theorem 4
  • Remark 4
  • ...and 46 more