Table of Contents
Fetching ...

Improved Convergence in Parameter-Agnostic Error Feedback through Momentum

Abdurakhmon Sadiev, Yury Demidovich, Igor Sokolov, Grigory Malinovsky, Sarit Khirirat, Peter Richtárik

TL;DR

The paper tackles the challenge of distributed training with compressed communications by introducing EF21 methods that use normalization and parameter-agnostic stepsizes, enabling practical optimization without knowing problem-specific constants. Five momentum variants (Polyak, IGT, RHM, HM, MVR) are analyzed, yielding near-optimal convergence rates for nonconvex smooth objectives, with rates ranging from $\tilde{O}(1/T^{1/4})$ to $\tilde{O}(1/T^{1/3})$. Theoretical results are complemented by experiments on CIFAR-10 with ResNet-18, showing improved sample efficiency and favorable wall-clock performance, particularly for IGT and HM variants. The findings highlight that normalization plus momentum can closely match tuned, problem-dependent EF21 performance while reducing the practical burden of hyperparameter estimation in large-scale training. The work advances scalable distributed optimization under biased compression, with implications for federated and data-heterogeneous settings.

Abstract

Communication compression is essential for scalable distributed training of modern machine learning models, but it often degrades convergence due to the noise it introduces. Error Feedback (EF) mechanisms are widely adopted to mitigate this issue of distributed compression algorithms. Despite their popularity and training efficiency, existing distributed EF algorithms often require prior knowledge of problem parameters (e.g., smoothness constants) to fine-tune stepsizes. This limits their practical applicability especially in large-scale neural network training. In this paper, we study normalized error feedback algorithms that combine EF with normalized updates, various momentum variants, and parameter-agnostic, time-varying stepsizes, thus eliminating the need for problem-dependent tuning. We analyze the convergence of these algorithms for minimizing smooth functions, and establish parameter-agnostic complexity bounds that are close to the best-known bounds with carefully-tuned problem-dependent stepsizes. Specifically, we show that normalized EF21 achieve the convergence rate of near ${O}(1/T^{1/4})$ for Polyak's heavy-ball momentum, ${O}(1/T^{2/7})$ for Iterative Gradient Transport (IGT), and ${O}(1/T^{1/3})$ for STORM and Hessian-corrected momentum. Our results hold with decreasing stepsizes and small mini-batches. Finally, our empirical experiments confirm our theoretical insights.

Improved Convergence in Parameter-Agnostic Error Feedback through Momentum

TL;DR

The paper tackles the challenge of distributed training with compressed communications by introducing EF21 methods that use normalization and parameter-agnostic stepsizes, enabling practical optimization without knowing problem-specific constants. Five momentum variants (Polyak, IGT, RHM, HM, MVR) are analyzed, yielding near-optimal convergence rates for nonconvex smooth objectives, with rates ranging from to . Theoretical results are complemented by experiments on CIFAR-10 with ResNet-18, showing improved sample efficiency and favorable wall-clock performance, particularly for IGT and HM variants. The findings highlight that normalization plus momentum can closely match tuned, problem-dependent EF21 performance while reducing the practical burden of hyperparameter estimation in large-scale training. The work advances scalable distributed optimization under biased compression, with implications for federated and data-heterogeneous settings.

Abstract

Communication compression is essential for scalable distributed training of modern machine learning models, but it often degrades convergence due to the noise it introduces. Error Feedback (EF) mechanisms are widely adopted to mitigate this issue of distributed compression algorithms. Despite their popularity and training efficiency, existing distributed EF algorithms often require prior knowledge of problem parameters (e.g., smoothness constants) to fine-tune stepsizes. This limits their practical applicability especially in large-scale neural network training. In this paper, we study normalized error feedback algorithms that combine EF with normalized updates, various momentum variants, and parameter-agnostic, time-varying stepsizes, thus eliminating the need for problem-dependent tuning. We analyze the convergence of these algorithms for minimizing smooth functions, and establish parameter-agnostic complexity bounds that are close to the best-known bounds with carefully-tuned problem-dependent stepsizes. Specifically, we show that normalized EF21 achieve the convergence rate of near for Polyak's heavy-ball momentum, for Iterative Gradient Transport (IGT), and for STORM and Hessian-corrected momentum. Our results hold with decreasing stepsizes and small mini-batches. Finally, our empirical experiments confirm our theoretical insights.

Paper Structure

This paper contains 78 sections, 9 theorems, 183 equations, 4 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Consider Problem (eqn:problem), where Assumptions assum:contract_comp, assum:inf, assum:L_comp_fcn, assum:L_fcn, and assum:stoc_g_h hold. Let tuning parameters satisfy with $\gamma_0 >0$. Then, the iterates $\{x^t\}$ governed by EF21-SGDM satisfy where $\tilde{x}^T$ is randomly chosen from $\{x^0,x^1,\ldots,x^{T-1}\}$ with probability $\gamma_t/\sum_{t=0}^{T-1}\gamma_t$ for $t=0,1,\ldots,T-1$.

Figures (4)

  • Figure 1: Performance comparison of all methods on CIFAR-10 with ResNet-18, plotted as a function of epochs. The proposed momentum variants, particularly $\left\| \text{EF21-HM} \right\|$ and $\left\| \text{EF21-IGT} \right\|$, show superior sample efficiency.
  • Figure 2: Performance comparison of all methods on CIFAR-10 with ResNet-18, plotted as a function of epochs. The proposed momentum variants, particularly $\left\| \text{EF21-HM} \right\|$ and $\left\| \text{EF21-IGT} \right\|$, show superior sample efficiency.
  • Figure 3: Performance comparison as a function of cumulative wall-clock seconds over the full training duration. Methods with higher per-epoch costs take longer to complete the 90-epoch training schedule.
  • Figure 4: Time-to-solution performance comparison, with the timeline truncated to the completion time of the fastest methods. This view highlights that $\left\| \text{EF21-IGT} \right\|$ achieves a convergence speed and accuracy comparable to the much more costly $\left\| \text{EF21-HM} \right\|$.

Theorems & Definitions (11)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Lemma 1
  • Lemma 2: Descent Lemma
  • proof
  • Lemma 3
  • proof
  • ...and 1 more