Table of Contents
Fetching ...

Parallel Momentum Methods Under Biased Gradient Estimations

Ali Beikmohammadi, Sarit Khirirat, Sindri Magnússon

TL;DR

The paper addresses the challenge of parallel momentum methods under biased gradient estimations in distributed optimization. It develops a unified, worst-case convergence framework for general non-convex and $\mu$-PL objectives that does not require unbiased gradient estimates, and it applies the results to biased gradient models including compression, clipping, and stochastic composite gradients (e.g., MAML). The authors derive descent-based bounds and affine-variance noise assumptions, establishing sublinear to linear convergence with a bias-dependent residual under bias, and demonstrate improved performance over biased SGD in distributed neural network experiments. The work provides practical guidance for deploying momentum in server–worker setups with communication-efficient gradients, and it highlights robustness and faster convergence across diverse bias scenarios. Overall, the framework broadens the applicability of momentum methods in real-world distributed learning where unbiased gradients are not available.

Abstract

Parallel stochastic gradient methods are gaining prominence in solving large-scale machine learning problems that involve data distributed across multiple nodes. However, obtaining unbiased stochastic gradients, which have been the focus of most theoretical research, is challenging in many distributed machine learning applications. The gradient estimations easily become biased, for example, when gradients are compressed or clipped, when data is shuffled, and in meta-learning and reinforcement learning. In this work, we establish worst-case bounds on parallel momentum methods under biased gradient estimation on both general non-convex and $μ$-PL problems. Our analysis covers general distributed optimization problems, and we work out the implications for special cases where gradient estimates are biased, i.e. in meta-learning and when the gradients are compressed or clipped. Our numerical experiments verify our theoretical findings and show faster convergence performance of momentum methods than traditional biased gradient descent.

Parallel Momentum Methods Under Biased Gradient Estimations

TL;DR

The paper addresses the challenge of parallel momentum methods under biased gradient estimations in distributed optimization. It develops a unified, worst-case convergence framework for general non-convex and -PL objectives that does not require unbiased gradient estimates, and it applies the results to biased gradient models including compression, clipping, and stochastic composite gradients (e.g., MAML). The authors derive descent-based bounds and affine-variance noise assumptions, establishing sublinear to linear convergence with a bias-dependent residual under bias, and demonstrate improved performance over biased SGD in distributed neural network experiments. The work provides practical guidance for deploying momentum in server–worker setups with communication-efficient gradients, and it highlights robustness and faster convergence across diverse bias scenarios. Overall, the framework broadens the applicability of momentum methods in real-world distributed learning where unbiased gradients are not available.

Abstract

Parallel stochastic gradient methods are gaining prominence in solving large-scale machine learning problems that involve data distributed across multiple nodes. However, obtaining unbiased stochastic gradients, which have been the focus of most theoretical research, is challenging in many distributed machine learning applications. The gradient estimations easily become biased, for example, when gradients are compressed or clipped, when data is shuffled, and in meta-learning and reinforcement learning. In this work, we establish worst-case bounds on parallel momentum methods under biased gradient estimation on both general non-convex and -PL problems. Our analysis covers general distributed optimization problems, and we work out the implications for special cases where gradient estimates are biased, i.e. in meta-learning and when the gradients are compressed or clipped. Our numerical experiments verify our theoretical findings and show faster convergence performance of momentum methods than traditional biased gradient descent.
Paper Structure (33 sections, 9 theorems, 54 equations, 2 figures)

This paper contains 33 sections, 9 theorems, 54 equations, 2 figures.

Key Result

Lemma 1

Consider the momentum methods in Eq. eqn:momentum_equivalent_x_k and eqn:momentum_equivalent_v_k for Problem eqn:Problem where Assumption assum:smooth_bounded holds. Let $\phi^k = f(x^k) - f^\star + A \| \nabla f(x^k) - v^{k-1}\|^2$ for $A>0$ and $\eta^{k}$ in eqn:generic_eta_k. Then, where $B_1 = \gamma \frac{1-\beta}{2} + A\left(1-\frac{\beta}{2}\right)$, $B_2 = \frac{1}{2\gamma} - \frac{L}{2}

Figures (2)

  • Figure 1: Performance of parallel SGDM (i.e. momentum method) and SGD method under various biased gradient estimations ((a) and (c) compressed gradients; (b) and (d) clipped gradients) in terms of (left plots -) training loss and (right plots -) test accuracy on (top plots -) MNIST and (bottom plots -) FashionMNIST datasets, considering $n=100$ and $\gamma=0.5$.
  • Figure 2: Effect of the parameters $\sigma^2$, $\delta$, and $K$, which change the noise, bias, and compression level, respectively, for Top-$K$ sparsification. Here we optimize $f(x)= \frac{1}{2}||Ax||^2, ~x \in \mathbb{R}^{10}$, considering $\gamma = 0.5$, and $\beta = 0.1$.

Theorems & Definitions (19)

  • Lemma 1
  • proof
  • Theorem 1
  • proof
  • Remark 1
  • Remark 2
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • ...and 9 more