Table of Contents
Fetching ...

Error Feedback Reloaded: From Quadratic to Arithmetic Mean of Smoothness Constants

Peter Richtárik, Elnur Gasanov, Konstantin Burlachenko

TL;DR

This work advances distributed optimization with contractive compression by replacing the EF21 reliance on the quadratic-mean smoothness L_QM with the arithmetic-mean L_AM, yielding faster convergence in heterogeneous settings. The authors develop a discovery-driven path: from a cloning-reformulation to a scalable EF21-W that uses weights tied to client smoothness and does not require cloning, and they provide a weighted analysis validating the improved rate. The approach extends to EF21-SGD, EF21-PP, and rare-features regimes, with experiments showing practical gains on nonconvex problems and federated-like settings. Overall, the paper delivers a cohesive theory and empirical evidence that arithmetic-mean smoothness-based analysis substantially strengthens EF21’s performance under realistic data heterogeneity.

Abstract

Error Feedback (EF) is a highly popular and immensely effective mechanism for fixing convergence issues which arise in distributed training methods (such as distributed GD or SGD) when these are enhanced with greedy communication compression techniques such as TopK. While EF was proposed almost a decade ago (Seide et al., 2014), and despite concentrated effort by the community to advance the theoretical understanding of this mechanism, there is still a lot to explore. In this work we study a modern form of error feedback called EF21 (Richtarik et al., 2021) which offers the currently best-known theoretical guarantees, under the weakest assumptions, and also works well in practice. In particular, while the theoretical communication complexity of EF21 depends on the quadratic mean of certain smoothness parameters, we improve this dependence to their arithmetic mean, which is always smaller, and can be substantially smaller, especially in heterogeneous data regimes. We take the reader on a journey of our discovery process. Starting with the idea of applying EF21 to an equivalent reformulation of the underlying problem which (unfortunately) requires (often impractical) machine cloning, we continue to the discovery of a new weighted version of EF21 which can (fortunately) be executed without any cloning, and finally circle back to an improved analysis of the original EF21 method. While this development applies to the simplest form of EF21, our approach naturally extends to more elaborate variants involving stochastic gradients and partial participation. Further, our technique improves the best-known theory of EF21 in the rare features regime (Richtarik et al., 2023). Finally, we validate our theoretical findings with suitable experiments.

Error Feedback Reloaded: From Quadratic to Arithmetic Mean of Smoothness Constants

TL;DR

This work advances distributed optimization with contractive compression by replacing the EF21 reliance on the quadratic-mean smoothness L_QM with the arithmetic-mean L_AM, yielding faster convergence in heterogeneous settings. The authors develop a discovery-driven path: from a cloning-reformulation to a scalable EF21-W that uses weights tied to client smoothness and does not require cloning, and they provide a weighted analysis validating the improved rate. The approach extends to EF21-SGD, EF21-PP, and rare-features regimes, with experiments showing practical gains on nonconvex problems and federated-like settings. Overall, the paper delivers a cohesive theory and empirical evidence that arithmetic-mean smoothness-based analysis substantially strengthens EF21’s performance under realistic data heterogeneity.

Abstract

Error Feedback (EF) is a highly popular and immensely effective mechanism for fixing convergence issues which arise in distributed training methods (such as distributed GD or SGD) when these are enhanced with greedy communication compression techniques such as TopK. While EF was proposed almost a decade ago (Seide et al., 2014), and despite concentrated effort by the community to advance the theoretical understanding of this mechanism, there is still a lot to explore. In this work we study a modern form of error feedback called EF21 (Richtarik et al., 2021) which offers the currently best-known theoretical guarantees, under the weakest assumptions, and also works well in practice. In particular, while the theoretical communication complexity of EF21 depends on the quadratic mean of certain smoothness parameters, we improve this dependence to their arithmetic mean, which is always smaller, and can be substantially smaller, especially in heterogeneous data regimes. We take the reader on a journey of our discovery process. Starting with the idea of applying EF21 to an equivalent reformulation of the underlying problem which (unfortunately) requires (often impractical) machine cloning, we continue to the discovery of a new weighted version of EF21 which can (fortunately) be executed without any cloning, and finally circle back to an improved analysis of the original EF21 method. While this development applies to the simplest form of EF21, our approach naturally extends to more elaborate variants involving stochastic gradients and partial participation. Further, our technique improves the best-known theory of EF21 in the rare features regime (Richtarik et al., 2023). Finally, we validate our theoretical findings with suitable experiments.
Paper Structure (61 sections, 28 theorems, 201 equations, 14 figures, 6 tables, 6 algorithms)

This paper contains 61 sections, 28 theorems, 201 equations, 14 figures, 6 tables, 6 algorithms.

Key Result

Theorem 2

Consider Algorithm alg:EF21 (EF21) applied to the "cloning reformulation" eq:cloning of the distributed optimization problem eq:main_problem, where $N^\star_i = \left \lceil L_i/{\color{blue}L_{\rm AM}} \right \rceil$ for all $i \in [n]$. Let Assumptions as:smooth--as:lower_bound hold, assume that $ and let the stepsize satisfy $0 < \gamma \leq \frac{1}{L + \sqrt{2}{\color{blue}L_{\rm AM}} \xi (\a

Figures (14)

  • Figure 1: Comparison of EF21 versus our new EF21-W with the Top1 compressor on the non-convex logistic regression problem. The number of clients $n$ is $1,000$. The step size for EF21 is set according to EF21, and the step size for EF21-W is set according to \ref{['thm:EF21-W']}. The coefficient $\lambda$ for (b)--(f) is set to $0.001$, and for (a) is set to $1,000$ for numerical stability. We let $L_{\rm var} \coloneqq {\color{red}L_{\rm QM}^2} - {\color{blue}L_{\rm AM}^2} = {\color{red}\frac{1}{n}\sum_{i=1}^n L_i^2} - {\color{blue}\left(\frac{1}{n}\sum_{i=1}^n L_i \right)^2}$.
  • Figure 2: Comparison of EF21-W with partial partial participation (EF21-W-PP) or stochastic gradients (EF21-W-SGD) versus EF21 with partial partial participation (EF21-PP) or stochastic gradients (EF21-SGD) EF21BW). The Top1 compressor was employed in all experiments. The number of clients $n=1,000$. All stepsizes are theoretical. The coefficient $\lambda$ was set to $0.001$ for (a), (b) and to $1,000$ for (c), (d).
  • Figure 3: Comparison of EF21 vs. EF21-W with the Top1 compressor on the non-convex linear problem. The number of clients $n$ is $2,000$. The coefficient $\lambda$ has been set to $100$. The step size for EF21 is set according to EF21, and the step size for EF21-W is set according to \ref{['thm:EF21-W']}. In all cases, the smoothness constant $L$ equals $50$.
  • Figure 4: The factor $\xi=\sqrt{{\beta}/{\theta}}$ as a function of optimization variable dimension $d$ for several TopK compressors. The behavior is independent of properties of $\{f_1(x),\dots,f_n(x)\}$ and $f(x)$.
  • Figure 5: Convex smooth optimization. EF21 and EF21-W with Top1 client compressor, $n=2\,000$, $d=10$. The objective function is constitute of $f_i(x)$ defined in Eq.\ref{['eq:linreg-cvx']}. Regularization term $\lambda \frac{\|x\|^2}{2}$, where $\lambda=0.01$. Theoretical step size. Full participation. Extra details are in Table \ref{['tbl:app-syn-ef21-cvx']}.
  • ...and 9 more figures

Theorems & Definitions (50)

  • Definition 1: Compressors
  • Example 1
  • Theorem 2: Convergence of EF21 applied to problem \ref{['eq:cloning']} with $N^\star$ machines
  • Theorem 3: Theory for EF21-W
  • Theorem 4: New theory for EF21
  • Lemma 1: Optimal weights
  • Lemma 2: $\sqrt{2}$-approximation
  • proof
  • Lemma 3: PAGE2021
  • Lemma 4: Young's inequality
  • ...and 40 more