FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning

Junkang Liu; Fanhua Shang; Yuanyuan Liu; Hongying Liu; Yuangang Li; YunXiang Gong

FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning

Junkang Liu, Fanhua Shang, Yuanyuan Liu, Hongying Liu, Yuangang Li, YunXiang Gong

TL;DR

This paper proposes a novel Federated Block Coordinate Gradient Descent method for communication efficiency and provides the convergence analysis for the proposed algorithms, which are the first work on parameter block communication for training large-scale deep models.

Abstract

Although Federated Learning has been widely studied in recent years, there are still high overhead expenses in each communication round for large-scale models such as Vision Transformer. To lower the communication complexity, we propose a novel Federated Block Coordinate Gradient Descent (FedBCGD) method for communication efficiency. The proposed method splits model parameters into several blocks, including a shared block and enables uploading a specific parameter block by each client, which can significantly reduce communication overhead. Moreover, we also develop an accelerated FedBCGD algorithm (called FedBCGD+) with client drift control and stochastic variance reduction. To the best of our knowledge, this paper is the first work on parameter block communication for training large-scale deep models. We also provide the convergence analysis for the proposed algorithms. Our theoretical results show that the communication complexities of our algorithms are a factor $1/N$ lower than those of existing methods, where $N$ is the number of parameter blocks, and they enjoy much faster convergence than their counterparts. Empirical results indicate the superiority of the proposed algorithms compared to state-of-the-art algorithms. The code is available at https://github.com/junkangLiu0/FedBCGD.

FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning

TL;DR

Abstract

lower than those of existing methods, where

is the number of parameter blocks, and they enjoy much faster convergence than their counterparts. Empirical results indicate the superiority of the proposed algorithms compared to state-of-the-art algorithms. The code is available at https://github.com/junkangLiu0/FedBCGD.

Paper Structure (35 sections, 22 theorems, 134 equations, 9 figures, 11 tables, 3 algorithms)

This paper contains 35 sections, 22 theorems, 134 equations, 9 figures, 11 tables, 3 algorithms.

Introduction
Related Work
Communication-Efficient Block Coordinate Gradient Descent FL
The proposed FedBCGD Algorithm
Our FedBCGD+ Algorithm
Theoretical Guarantees
Theoretical Results of FedBCGD
Theoretical Results of FedBCGD+
Experiments
Experimental Settings and Baselines
Results on Non-Convex Problems
Results on Convex Problems
Conclusion
Appendix A: Basic Assumptions and Notations
Basic Assumptions
...and 20 more sections

Key Result

Theorem 1

For $\beta$-smooth functions $\left\{f_i\right\}$, which satisfy Assumptions 1-5 (see the Appendix for details), the output of FedBCGD has expected error smaller than $\epsilon$ for some values of $\eta, R$, where $R$ denotes the number of communication rounds, $Com$ is the communication complexity Non-convex: $\tilde{\eta}=\frac{1}{4} \alpha \eta T$, $\tilde{\eta} \leq \frac{1}{16 \beta}$, $F:=

Figures (9)

Figure 1: The diagram of the proposed FedBCGD framework, where $S\geq N$, $S$ and $N$ are the numbers of clients and parameter blocks, respectively.
Figure 2: The client parameter block allocation in FedBCGD. For the sake of convenience, we suppose $S=N\cdot K$ clients are sampled and divided into $N$ client blocks, i.e., $K$ clients for each client block. The clients in the $i$-th client block are responsible for optimizing the upload parameter block $i$.
Figure 3: The convergence comparison of our FedBCGD and FedBCGD+, and other baselines on the CIFAR10 and CIFAR100 datasets with different neural network architectures, where, in 100 clients, partial (10%) clients are used, $\rho\!=\!0.6$.
Figure 4: The acceleration comparison of FedBCGD with different numbers of blocks.
Figure 5: Accuracy comparison of FedBCGD with LeNet-5 on CIFAR10 (a) and CIFAR100 (b), where heterogeneity is $\rho\!=\!0.6$. FedBCGD_freezing_nonshare is updated by using the local freezing parameter algorithm without the shared block. FedBCGD_freezing_share is FedBCGD_freezing algorithm with shared parameters. FedBCGD_nonshare trains all parameters locally and only transmits parameter blocks without shared parameters. FedBCGD_share has shared parameters. FedBCGD_share_momentum (i.e., FedBCGD) has momentum acceleration.
...and 4 more figures

Theorems & Definitions (22)

Theorem 1: FedBCGD
Theorem 2: FedBCGD+
Theorem 1: Convergence rates of FedBCGD
Theorem 2: Convergence rates of FedBCGD+
Lemma 1
Lemma 2: Bounding heterogeneity
Lemma 3
Lemma 4
Lemma 5: Bounded drift
Lemma 6
...and 12 more

FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning

TL;DR

Abstract

FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (22)