Table of Contents
Fetching ...

Directional Smoothness and Gradient Methods: Convergence and Adaptivity

Aaron Mishkin, Ahmed Khaled, Yuanhao Wang, Aaron Defazio, Robert M. Gower

TL;DR

This work replaces global $L$-smoothness with directional smoothness $M$ to derive path-dependent sub-optimality bounds for gradient methods. It develops a hierarchy of directional smoothness notions (point-wise $D$, path-wise $A$, and optimal $H$) and shows how adapting step-sizes to $M$ tightens progress, with extensions to acceleration. In the quadratic case, strongly adapted steps align with Rayleigh quotients and recover classical step-sizes like the Cauchy step-size; for general convex functions, exponential search and Polyak's step-size provide fast path-dependent rates without requiring explicit knowledge of $M$. Normalized GD also attains fast rates under mild conditions, and extensive logistic-regression experiments demonstrate substantially tighter convergence behavior than traditional $L$-smooth analyses. Overall, the paper offers a practical, path-aware framework for adaptive gradient methods with strong theoretical and empirical support.

Abstract

We develop new sub-optimality bounds for gradient descent (GD) that depend on the conditioning of the objective along the path of optimization rather than on global, worst-case constants. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective. Minimizing these upper-bounds requires solving implicit equations to obtain a sequence of strongly adapted step-sizes; we show that these equations are straightforward to solve for convex quadratics and lead to new guarantees for two classical step-sizes. For general functions, we prove that the Polyak step-size and normalized GD obtain fast, path-dependent rates despite using no knowledge of the directional smoothness. Experiments on logistic regression show our convergence guarantees are tighter than the classical theory based on $L$-smoothness.

Directional Smoothness and Gradient Methods: Convergence and Adaptivity

TL;DR

This work replaces global -smoothness with directional smoothness to derive path-dependent sub-optimality bounds for gradient methods. It develops a hierarchy of directional smoothness notions (point-wise , path-wise , and optimal ) and shows how adapting step-sizes to tightens progress, with extensions to acceleration. In the quadratic case, strongly adapted steps align with Rayleigh quotients and recover classical step-sizes like the Cauchy step-size; for general convex functions, exponential search and Polyak's step-size provide fast path-dependent rates without requiring explicit knowledge of . Normalized GD also attains fast rates under mild conditions, and extensive logistic-regression experiments demonstrate substantially tighter convergence behavior than traditional -smooth analyses. Overall, the paper offers a practical, path-aware framework for adaptive gradient methods with strong theoretical and empirical support.

Abstract

We develop new sub-optimality bounds for gradient descent (GD) that depend on the conditioning of the objective along the path of optimization rather than on global, worst-case constants. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective. Minimizing these upper-bounds requires solving implicit equations to obtain a sequence of strongly adapted step-sizes; we show that these equations are straightforward to solve for convex quadratics and lead to new guarantees for two classical step-sizes. For general functions, we prove that the Polyak step-size and normalized GD obtain fast, path-dependent rates despite using no knowledge of the directional smoothness. Experiments on logistic regression show our convergence guarantees are tighter than the classical theory based on -smoothness.
Paper Structure (20 sections, 35 theorems, 175 equations, 4 figures, 1 algorithm)

This paper contains 20 sections, 35 theorems, 175 equations, 4 figures, 1 algorithm.

Key Result

Lemma 2.1

If $f$ is convex and differentiable, then the point-wise directional smoothness satisfies,

Figures (4)

  • Figure 1: Comparison of actual (solid lines) and theoretical (dashed lines) convergence rates for GD with (i) step-sizes strongly adapted to the directional smoothness ($\eta_k = 1 / M(x_{k+1}, x_k)$) and (ii) the Polyak step-size. Both problems are logistic regressions on UCI repository datasets asuncion2007uci. Our bounds using directional smoothness are tighter than those based on global $L$-smoothness of $f$ and adapt to the optimization path. For example, on mammographic our theoretical rate for the Polyak step-size concentrates rapidly exactly when the optimizer shows fast convergence.
  • Figure 2: Illustration of GD with $\eta_k = 1 / L$. Even though this step-size exactly minimizes the upper-bound from $L$-smoothness, $M_k$ directional smoothness better predicts the progress of the gradient step because $M_k \ll L$. Our rates improve on $L$-smoothness because of this tighter bound.
  • Figure 3: Performance of GD with different step-size rules for a synthetic quadratic problem. We run GD for 20,000 steps on 20 random quadratic problems with $L=1000$ and Hessian skew. Left-to-right, the first plot shows the optimality gap $f(x_k) - f(x^*)$, the second shows the point-wise directional smoothness $D( x_k, x_{k+1})$, and the third shows step-sizes used by the different methods.
  • Figure 4: Comparison of GD with $\eta_k = 1 / L$, step-sizes strongly adapted to the point-wise smoothness ($\eta_k = 1/D( x_k, x_{k+1})$), and the Polyak step-size against normalized GD (Norm. GD) and the AdGD method on three logistic regression problems. AdGD uses a smoothed version of the point-wise directional smoothness from the previous iteration to set $\eta_k$. We find that GD methods with adaptive step-sizes consistently outperform GD with $\eta_k = 1 / L$ and even obtain a linear rate on horse-colic.

Theorems & Definitions (60)

  • Definition 2.1
  • Lemma 2.1
  • Proposition 2.1
  • Lemma 2.1
  • Proposition 3.0
  • Proposition 3.0
  • Proposition 3.0
  • Theorem 3.1
  • Proposition 4.0
  • Proposition 4.0
  • ...and 50 more