Table of Contents
Fetching ...

Linear Convergence Rate in Convex Setup is Possible! Gradient Descent Method Variants under $(L_0,L_1)$-Smoothness

Aleksandr Lobanov, Alexander Gasnikov, Eduard Gorbunov, Martin Takáč

TL;DR

This work analyzes unconstrained convex optimization under the generalized $(L_0,L_1)$-smoothness, where the Hessian bound grows linearly with the gradient norm. It establishes that Gradient Descent and several variants (Normalized GD and Clipped GD) exhibit a linear convergence phase whenever the gradient norm satisfies \|\nabla f(x)\| \ge L_0/L_1, transitioning to standard sublinear rates as this condition ceases to hold. The authors extend the analysis to randomized coordinate methods (RCD) and OrderRCD, proving similar linear-convergence behavior under $(L_0,L_1)$-coordinate-smoothness and deriving explicit rate bounds, including the regime where L_0 = 0 yields linear convergence. An extension to strongly convex objectives provides compound decay bounds that improve upon prior results and do not require classical L-smoothness. Overall, the paper broadens the class of problems where fast linear convergence can be guaranteed and offers practical guidance for step-size selection and algorithm choices in ML optimization.

Abstract

The gradient descent (GD) method -- is a fundamental and likely the most popular optimization algorithm in machine learning (ML), with a history traced back to a paper in 1847 (Cauchy, 1847). It was studied under various assumptions, including so-called $(L_0,L_1)$-smoothness, which received noticeable attention in the ML community recently. In this paper, we provide a refined convergence analysis of gradient descent and its variants, assuming generalized smoothness. In particular, we show that $(L_0,L_1)$-GD has the following behavior in the convex setup: as long as $\|\nabla f(x^k)\| \geq \frac{L_0}{L_1}$ the algorithm has linear convergence in function suboptimality, and when $\|\nabla f(x^k)\| < \frac{L_0}{L_1}$ is satisfied, $(L_0,L_1)$-GD has standard sublinear rate. Moreover, we also show that this behavior is common for its variants with different types of oracle: Normalized Gradient Descent as well as Clipped Gradient Descent (the case when the full gradient $\nabla f(x)$ is available); Random Coordinate Descent (when the gradient component $\nabla_{i} f(x)$ is available); Random Coordinate Descent with Order Oracle (when only $\text{sign} [f(y) - f(x)]$ is available). In addition, we also extend our analysis of $(L_0,L_1)$-GD to the strongly convex case.

Linear Convergence Rate in Convex Setup is Possible! Gradient Descent Method Variants under $(L_0,L_1)$-Smoothness

TL;DR

This work analyzes unconstrained convex optimization under the generalized -smoothness, where the Hessian bound grows linearly with the gradient norm. It establishes that Gradient Descent and several variants (Normalized GD and Clipped GD) exhibit a linear convergence phase whenever the gradient norm satisfies \|\nabla f(x)\| \ge L_0/L_1, transitioning to standard sublinear rates as this condition ceases to hold. The authors extend the analysis to randomized coordinate methods (RCD) and OrderRCD, proving similar linear-convergence behavior under -coordinate-smoothness and deriving explicit rate bounds, including the regime where L_0 = 0 yields linear convergence. An extension to strongly convex objectives provides compound decay bounds that improve upon prior results and do not require classical L-smoothness. Overall, the paper broadens the class of problems where fast linear convergence can be guaranteed and offers practical guidance for step-size selection and algorithm choices in ML optimization.

Abstract

The gradient descent (GD) method -- is a fundamental and likely the most popular optimization algorithm in machine learning (ML), with a history traced back to a paper in 1847 (Cauchy, 1847). It was studied under various assumptions, including so-called -smoothness, which received noticeable attention in the ML community recently. In this paper, we provide a refined convergence analysis of gradient descent and its variants, assuming generalized smoothness. In particular, we show that -GD has the following behavior in the convex setup: as long as the algorithm has linear convergence in function suboptimality, and when is satisfied, -GD has standard sublinear rate. Moreover, we also show that this behavior is common for its variants with different types of oracle: Normalized Gradient Descent as well as Clipped Gradient Descent (the case when the full gradient is available); Random Coordinate Descent (when the gradient component is available); Random Coordinate Descent with Order Oracle (when only is available). In addition, we also extend our analysis of -GD to the strongly convex case.

Paper Structure

This paper contains 31 sections, 10 theorems, 104 equations, 1 table, 6 algorithms.

Key Result

Theorem 3.1

Let function $f$ satisfy Assumption ass:L0_L1_smooth ($(L_0,L_1)$-smoothness) and Assumption ass:strongly_convex (convexity, ${\mu=0}$), then GD (Algorithm algo:GD) with step size ${\eta_k = (L_0 + L_1 \left\| \nabla f(x^k) \right\| )^{-1}}$ guarantees In the general case, the convergence rate is where $T\geq 0$ is the smallest index such as $\|\nabla f(x^T)\| < \frac{L_0}{L_1}$.

Theorems & Definitions (19)

  • Theorem 3.1
  • Remark 3.2
  • Theorem 3.3
  • Remark 3.4
  • Theorem 3.5
  • Remark 3.6: Strong growth condition
  • Theorem 4.1
  • Remark 4.2: Strong growth condition
  • Theorem 4.3
  • Remark 4.4: Strong growth condition
  • ...and 9 more