Table of Contents
Fetching ...

Non-ergodic linear convergence property of the delayed gradient descent under the strongly convexity and the Polyak-Łojasiewicz condition

Hyung Jun Choi, Woocheol Choi, Jinmyoung Seok

TL;DR

The paper studies gradient descent with a fixed time delay $\tau$ and proves a non-ergodic linear convergence rate for $\mu$-strongly convex and $L$-smooth functions, improving on prior ergodic results. It introduces an auxiliary sequence and a careful inductive framework to derive explicit decay bounds, and shows that larger step sizes on the order of $1/(L\tau)$ are admissible. The authors extend the analysis to the Polyak-Łojasiewicz condition and to stochastic gradient descent with time-varying delay, providing comparable linear convergence guarantees under appropriate step-size choices. Numerical experiments on least-squares and logistic regression, plus a PL-satisfying example and a stochastic-delay SGD test, validate the theoretical findings and illustrate practical convergence under delays. This work offers rigorous convergence guarantees for delayed gradient methods relevant to asynchronous and distributed optimization.

Abstract

In this work, we establish the linear convergence estimate for the gradient descent involving the delay $τ\in\mathbb{N}$ when the cost function is $μ$-strongly convex and $L$-smooth. This result improves upon the well-known estimates in Arjevani et al. \cite{ASS} and Stich-Karmireddy \cite{SK} in the sense that it is non-ergodic and is still established in spite of weaker constraint of cost function. Also, the range of learning rate $η$ can be extended from $η\leq 1/(10Lτ)$ to $η\leq 1/(4Lτ)$ for $τ=1$ and $η\leq 3/(10Lτ)$ for $τ\geq 2$, where $L >0$ is the Lipschitz continuity constant of the gradient of cost function. In a further research, we show the linear convergence of cost function under the Polyak-Łojasiewicz\,(PL) condition, for which the available choice of learning rate is further improved as $η\leq 9/(10Lτ)$ for the large delay $τ$. The framework of the proof for this result is also extended to the stochastic gradient descent with time-varying delay under the PL condition. Finally, some numerical experiments are provided in order to confirm the reliability of the analyzed results.

Non-ergodic linear convergence property of the delayed gradient descent under the strongly convexity and the Polyak-Łojasiewicz condition

TL;DR

The paper studies gradient descent with a fixed time delay $\tau$ and proves a non-ergodic linear convergence rate for $\mu$-strongly convex and $L$-smooth functions, improving on prior ergodic results. It introduces an auxiliary sequence and a careful inductive framework to derive explicit decay bounds, and shows that larger step sizes on the order of $1/(L\tau)$ are admissible. The authors extend the analysis to the Polyak-Łojasiewicz condition and to stochastic gradient descent with time-varying delay, providing comparable linear convergence guarantees under appropriate step-size choices. Numerical experiments on least-squares and logistic regression, plus a PL-satisfying example and a stochastic-delay SGD test, validate the theoretical findings and illustrate practical convergence under delays. This work offers rigorous convergence guarantees for delayed gradient methods relevant to asynchronous and distributed optimization.

Abstract

In this work, we establish the linear convergence estimate for the gradient descent involving the delay when the cost function is -strongly convex and -smooth. This result improves upon the well-known estimates in Arjevani et al. \cite{ASS} and Stich-Karmireddy \cite{SK} in the sense that it is non-ergodic and is still established in spite of weaker constraint of cost function. Also, the range of learning rate can be extended from to for and for , where is the Lipschitz continuity constant of the gradient of cost function. In a further research, we show the linear convergence of cost function under the Polyak-Łojasiewicz\,(PL) condition, for which the available choice of learning rate is further improved as for the large delay . The framework of the proof for this result is also extended to the stochastic gradient descent with time-varying delay under the PL condition. Finally, some numerical experiments are provided in order to confirm the reliability of the analyzed results.
Paper Structure (10 sections, 10 theorems, 128 equations, 4 figures)

This paper contains 10 sections, 10 theorems, 128 equations, 4 figures.

Key Result

Theorem 1.1

Assume that $f:\mathbb{R}^d\rightarrow\mathbb{R}$ is $\mu$-strongly convex and $L$-smooth, and suppose that $f$ is given as $f(x) = 2^{-1}\,x^T Ax +b^T x +c$ for some $A\in\mathbb{R}^{d\times d}$, $b\in\mathbb{R}^d$ and $c\in\mathbb{R}$. Then, for a positive stepsize $\eta\leq \frac{1}{20L(\tau+1)}$

Figures (4)

  • Figure 1: Graphs of the log-scaled error $\mathcal{E}_t$ computed by the gradient descent with various delay $\tau=5$, $10$, $20$, $100$, regarding the least-squares regression problem.
  • Figure 2: Graphs of the log-scaled error $\mathcal{E}_t$ computed by the gradient descent with various delay $\tau=5$, $10$, $20$, $100$, regarding the logistic classification problem.
  • Figure 3: Graphs of the log-scaled cost error $e_t$ for the cost function $f(x)=2^{-1}\|Ax-b\|^2$, computed by the gradient descent with the delay $\tau=25$.
  • Figure 4: Graphs of the log-scaled cost error $e_t$ computed by the mini-batch stochastic gradient descent with various delay bounds $\boldsymbol{\tau}=10$, $50$.

Theorems & Definitions (19)

  • Theorem 1.1: ASS
  • Theorem 1.2: SK
  • Theorem 1.3
  • Remark 1.4
  • Theorem 1.5
  • Theorem 1.6
  • Proposition 2.1
  • proof
  • Lemma 2.2
  • proof
  • ...and 9 more