Table of Contents
Fetching ...

Optimizing $(L_0, L_1)$-Smooth Functions by Gradient Methods

Daniil Vankov, Anton Rodomanov, Angelia Nedich, Lalitha Sankar, Sebastian U. Stich

TL;DR

This work analyzes gradient-based optimization for $(L_0,L_1)$-smooth functions, a broad generalization of Lipschitz-smooth models relevant to modern learning. By deriving a tighter first-order characterization and principled gradient steps, the authors connect standard GM, normalized GM, and Polyak-step GM to sharper complexity bounds. In the convex setting, they significantly improve guarantees to $O\left(\frac{L_0 R^2}{\epsilon} + L_1 R \ln \frac{F_0}{\epsilon}\right)$, with adaptive methods that do not require explicit $(L_0,L_1)$ knowledge, and they introduce AGMsDR, achieving a fast rate $\nu\mathcal{O}\left(\sqrt{\frac{L_0 R^2}{\epsilon}} + \lceil(L_1 R)^{2/3}\rceil \lceil\ln \frac{F_0}{\epsilon}\rceil\right)$. The results unify nonconvex and convex analyses, provide improved worst-case bounds for several variants, and demonstrate practical gains via numerical experiments. Overall, the paper advances theory and practice for optimization under $(L_0,L_1)$-smoothness, offering adaptive, accelerated methods with strong guarantees across problem regimes.

Abstract

We study gradient methods for optimizing $(L_0, L_1)$-smooth functions, a class that generalizes Lipschitz-smooth functions and has gained attention for its relevance in machine learning. We provide new insights into the structure of this function class and develop a principled framework for analyzing optimization methods in this setting. While our convergence rate estimates recover existing results for minimizing the gradient norm in nonconvex problems, our approach significantly improves the best-known complexity bounds for convex objectives. Moreover, we show that the gradient method with Polyak stepsizes and the normalized gradient method achieve nearly the same complexity guarantees as methods that rely on explicit knowledge of~$(L_0, L_1)$. Finally, we demonstrate that a carefully designed accelerated gradient method can be applied to $(L_0, L_1)$-smooth functions, further improving all previous results.

Optimizing $(L_0, L_1)$-Smooth Functions by Gradient Methods

TL;DR

This work analyzes gradient-based optimization for -smooth functions, a broad generalization of Lipschitz-smooth models relevant to modern learning. By deriving a tighter first-order characterization and principled gradient steps, the authors connect standard GM, normalized GM, and Polyak-step GM to sharper complexity bounds. In the convex setting, they significantly improve guarantees to , with adaptive methods that do not require explicit knowledge, and they introduce AGMsDR, achieving a fast rate . The results unify nonconvex and convex analyses, provide improved worst-case bounds for several variants, and demonstrate practical gains via numerical experiments. Overall, the paper advances theory and practice for optimization under -smoothness, offering adaptive, accelerated methods with strong guarantees across problem regimes.

Abstract

We study gradient methods for optimizing -smooth functions, a class that generalizes Lipschitz-smooth functions and has gained attention for its relevance in machine learning. We provide new insights into the structure of this function class and develop a principled framework for analyzing optimization methods in this setting. While our convergence rate estimates recover existing results for minimizing the gradient norm in nonconvex problems, our approach significantly improves the best-known complexity bounds for convex objectives. Moreover, we show that the gradient method with Polyak stepsizes and the normalized gradient method achieve nearly the same complexity guarantees as methods that rely on explicit knowledge of~. Finally, we demonstrate that a carefully designed accelerated gradient method can be applied to -smooth functions, further improving all previous results.

Paper Structure

This paper contains 34 sections, 21 theorems, 168 equations, 3 figures, 1 algorithm.

Key Result

Proposition 2.4

Let $f \colon \mathbb{R}^d \rightarrow \mathbb{R}$ be a twice continuously differentiable $(L_0, L_1)$-smooth function. Then, the following statements hold:

Figures (3)

  • Figure 7.1: Comparison of gradient methods for $f(x) = \frac{1}{p} \|x\|^p$. $\frac{\hat{R}}{R}$-NGD stands for Normalized Gradient Method, where $\hat{R}$ is an estimation of the true initial distance to a solution $R$. $\eta_*$-GD, $\eta^{\mathrm{si}}$-GD, $\eta^{\mathrm{cl}}$-GD stand for gradient method with stepsizes \ref{['eq-grad-step-1', 'eq-grad-step-2', 'eq-clipping']} respectively, PS-GD stands for Polyak stepsizes gradient method, and AGMsDR stands for Algorithm \ref{['algo:AGMsDR']}.
  • Figure 7.2: Convergence of the gradient method on the same function but with different choices of $(L_0, L_1)$.
  • Figure 7.3: Comparison of Algorithm \ref{['algo:AGMsDR']} denoted by AGMsDR with Similar Triangles Method (SMT) and Similar Triangles Method Max (STM-max) for $f(x) = \frac{1}{p}\|x\|^p$, with different values $p$.

Theorems & Definitions (45)

  • Definition 2.1
  • Example 2.2
  • Example 2.3
  • Proposition 2.4
  • Lemma 2.5
  • Lemma 2.6
  • Lemma 2.7
  • Corollary 2.8
  • theorem 3.1
  • theorem 3.2
  • ...and 35 more