Sparse Training for Federated Learning with Regularized Error Correction

Ran Greidi; Kobi Cohen

Sparse Training for Federated Learning with Regularized Error Correction

Ran Greidi, Kobi Cohen

TL;DR

This paper tackles the communication bottleneck in federated learning by pushing sparsity through Top-$K$ updates while counteracting the staleness induced by error accumulation. It introduces FLARE, a sparse training framework that uses accumulated residuals and a regularized embedding loss to guide in-round optimization, enabling sparsity levels up to $R=0.001\%$ with accuracy gains over prior methods. The authors provide a convergence analysis showing FLARE matches $O(1/\sqrt{T})$ rates and exhibits improved scaling with sparsity, plus extensive experiments on MNIST, CIFAR-10, and Shakespeare text that demonstrate substantial performance gains and robustness. An open-source implementation is released, underscoring FLARE's potential for practical, bandwidth-constrained FL deployments.

Abstract

Federated Learning (FL) has attracted much interest due to the significant advantages it brings to training deep neural network (DNN) models. However, since communications and computation resources are limited, training DNN models in FL systems face challenges such as elevated computational and communication costs in complex tasks. Sparse training schemes gain increasing attention in order to scale down the dimensionality of each client (i.e., node) transmission. Specifically, sparsification with error correction methods is a promising technique, where only important updates are sent to the parameter server (PS) and the rest are accumulated locally. While error correction methods have shown to achieve a significant sparsification level of the client-to-PS message without harming convergence, pushing sparsity further remains unresolved due to the staleness effect. In this paper, we propose a novel algorithm, dubbed Federated Learning with Accumulated Regularized Embeddings (FLARE), to overcome this challenge. FLARE presents a novel sparse training approach via accumulated pulling of the updated models with regularization on the embeddings in the FL process, providing a powerful solution to the staleness effect, and pushing sparsity to an exceptional level. The performance of FLARE is validated through extensive experiments on diverse and complex models, achieving a remarkable sparsity level (10 times and more beyond the current state-of-the-art) along with significantly improved accuracy. Additionally, an open-source software package has been developed for the benefit of researchers and developers in related fields.

Sparse Training for Federated Learning with Regularized Error Correction

TL;DR

This paper tackles the communication bottleneck in federated learning by pushing sparsity through Top-

updates while counteracting the staleness induced by error accumulation. It introduces FLARE, a sparse training framework that uses accumulated residuals and a regularized embedding loss to guide in-round optimization, enabling sparsity levels up to

with accuracy gains over prior methods. The authors provide a convergence analysis showing FLARE matches

rates and exhibits improved scaling with sparsity, plus extensive experiments on MNIST, CIFAR-10, and Shakespeare text that demonstrate substantial performance gains and robustness. An open-source implementation is released, underscoring FLARE's potential for practical, bandwidth-constrained FL deployments.

Abstract

Paper Structure (19 sections, 2 theorems, 44 equations, 10 figures, 1 table, 1 algorithm)

This paper contains 19 sections, 2 theorems, 44 equations, 10 figures, 1 table, 1 algorithm.

Introduction
Main Results
System Model and Problem Statement
The Federated Learning with Accumulated Regularized Embeddings (FLARE) Algorithm
Introduction to FLARE
Algorithm Description
Further Insights and Observations of FLARE Algorithm
Convergence Analysis
Experiments
Experiment 1: FC and CNN Models on the MNIST Dataset
Experiment 2: VGG 11, 16, 19 Models on the CIFAR10 Dataset
Experiment 3: Text generation based on "The Complete Works of William Shakespeare"
Experiment 4: Client unavailability
Experiment 5: Imbalanced MNIST Dataset with the FC Model
Experiment 6: Evaluation with reduced sparsity
...and 4 more sections

Key Result

Theorem 1

Consider the case where Assumption assumption1, assumption3 hold, and the loss function $F$ is convex. Denote the model iterates of FLARE by $\{ \bar{w}_t\}_{t>0}$, with step size $\gamma$, and let $\bar{w}^{avg}_T=\frac{1}{T}\sum^{T}_{t=0}{\bar{w}_t}$. Denote the minimizer of $\tilde{F}$ by $w^{*}$ where

Figures (10)

Figure 1: An illustration of the FLARE algorithm in four stages (refer to a detailed description in Section \ref{['ssec:algorithm']}): First, the PS broadcasts a global model $\bar{w}_{0}$ to the clients (Stage 1). Subsequently, each client generates a new model, sends its Top-$K$ deltas to the PS, and accumulates the error locally (Stage 2). Next, the PS aggregates all received models and broadcasts a new global model to all clients (Stage 3). FLARE attempts to minimize the staled updates by minimizing $\tau||\bar{w}-(\bar{w}_{1}+\bar{A}^i_1)||_1$ held by each client (Stage 4). The clients redefine their loss according to \ref{['eqn:1']}.
Figure 2: An illustration of the function $p(\tau)$.
Figure 3: The test accuracy performance is compared for FC and CNN models on MNIST with $E=1$ and $B=\infty$. $1$-FLARE is implemented with a sparsity setting of $R=0.001\%$. FFL, EF21, FedProx and Error Correction are included in the comparison. The uncompressed FedAvg is presented as a benchmark for performance. Remarkably, $1$-FLARE outperforms all other methods even when FFL is configured with a sparsity of $R=0.01\%$.
Figure 4: The test accuracy performance is compared for CNN model on digit MNIST with $R=0.001\%$, $p=1,2,4$, $\tau=0.05$, $c=1.1$, 10 clients with $B=\infty$ and $E=4,8,16,32$. $p$-FLARE is compared with uncompressed FedAvg (benchmark for performance), Error Correction, FFL, EF21 and FedProx methods. At the bottom of each figure, $E$ and $log_{10}(R)$ values are plotted to illustrate the sparsity and computation levels of $p$-FLARE compared to FFL. It is evident that $p$-FLARE significantly outperforms all other methods in all cases.
Figure 5: The test accuracy performance is compared for FC model on digit MNIST with $R=0.001\%$, $p=1,2,4$, $\tau=0.5$, $c=1.05$, 10 clients with $B=\infty$ and $E=4,8,16,32$. $p$-FLARE is compared with uncompressed FedAvg (benchmark for performance), Error Correction, FFL, EF21 and FedProx methods. At the bottom of each figure, $E$ and $log_{10}(R)$ values are plotted to illustrate the sparsity and computation levels of $p$-FLARE compared to FFL. It is evident that $p$-FLARE outperforms all other methods in all cases in terms of convergence rate, demonstrating strong performance in accuracy as well.
...and 5 more figures

Theorems & Definitions (2)

Theorem 1
Theorem 2

Sparse Training for Federated Learning with Regularized Error Correction

TL;DR

Abstract

Sparse Training for Federated Learning with Regularized Error Correction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (2)