Directional Smoothness and Gradient Methods: Convergence and Adaptivity

Aaron Mishkin; Ahmed Khaled; Yuanhao Wang; Aaron Defazio; Robert M. Gower

Directional Smoothness and Gradient Methods: Convergence and Adaptivity

Aaron Mishkin, Ahmed Khaled, Yuanhao Wang, Aaron Defazio, Robert M. Gower

TL;DR

This work replaces global $L$-smoothness with directional smoothness $M$ to derive path-dependent sub-optimality bounds for gradient methods. It develops a hierarchy of directional smoothness notions (point-wise $D$, path-wise $A$, and optimal $H$) and shows how adapting step-sizes to $M$ tightens progress, with extensions to acceleration. In the quadratic case, strongly adapted steps align with Rayleigh quotients and recover classical step-sizes like the Cauchy step-size; for general convex functions, exponential search and Polyak's step-size provide fast path-dependent rates without requiring explicit knowledge of $M$. Normalized GD also attains fast rates under mild conditions, and extensive logistic-regression experiments demonstrate substantially tighter convergence behavior than traditional $L$-smooth analyses. Overall, the paper offers a practical, path-aware framework for adaptive gradient methods with strong theoretical and empirical support.

Abstract

We develop new sub-optimality bounds for gradient descent (GD) that depend on the conditioning of the objective along the path of optimization rather than on global, worst-case constants. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective. Minimizing these upper-bounds requires solving implicit equations to obtain a sequence of strongly adapted step-sizes; we show that these equations are straightforward to solve for convex quadratics and lead to new guarantees for two classical step-sizes. For general functions, we prove that the Polyak step-size and normalized GD obtain fast, path-dependent rates despite using no knowledge of the directional smoothness. Experiments on logistic regression show our convergence guarantees are tighter than the classical theory based on $L$-smoothness.

Directional Smoothness and Gradient Methods: Convergence and Adaptivity

TL;DR

This work replaces global

-smoothness with directional smoothness

to derive path-dependent sub-optimality bounds for gradient methods. It develops a hierarchy of directional smoothness notions (point-wise

, path-wise

, and optimal

) and shows how adapting step-sizes to

tightens progress, with extensions to acceleration. In the quadratic case, strongly adapted steps align with Rayleigh quotients and recover classical step-sizes like the Cauchy step-size; for general convex functions, exponential search and Polyak's step-size provide fast path-dependent rates without requiring explicit knowledge of

. Normalized GD also attains fast rates under mild conditions, and extensive logistic-regression experiments demonstrate substantially tighter convergence behavior than traditional

-smooth analyses. Overall, the paper offers a practical, path-aware framework for adaptive gradient methods with strong theoretical and empirical support.

Abstract

-smoothness.

Paper Structure (20 sections, 35 theorems, 175 equations, 4 figures, 1 algorithm)

This paper contains 20 sections, 35 theorems, 175 equations, 4 figures, 1 algorithm.

Introduction
Additional Related Work
Directional Smoothness
Path-Dependent Sub-Optimality Bounds
Path-Dependent Acceleration
Adaptive Learning Rates
Adaptivity in Quadratics
Adaptivity for Convex Functions
Exponential Search
Polyak's Step-Size Rule
Normalized Gradient Descent
Experiments
Conclusion
Proofs for Section \ref{['sec:local-direct-smoothn']}
Proofs for Section \ref{['sec:path-dependent-rates']}
...and 5 more sections

Key Result

Lemma 2.1

If $f$ is convex and differentiable, then the point-wise directional smoothness satisfies,

Figures (4)

Figure 1: Comparison of actual (solid lines) and theoretical (dashed lines) convergence rates for GD with (i) step-sizes strongly adapted to the directional smoothness ($\eta_k = 1 / M(x_{k+1}, x_k)$) and (ii) the Polyak step-size. Both problems are logistic regressions on UCI repository datasets asuncion2007uci. Our bounds using directional smoothness are tighter than those based on global $L$-smoothness of $f$ and adapt to the optimization path. For example, on mammographic our theoretical rate for the Polyak step-size concentrates rapidly exactly when the optimizer shows fast convergence.
Figure 2: Illustration of GD with $\eta_k = 1 / L$. Even though this step-size exactly minimizes the upper-bound from $L$-smoothness, $M_k$ directional smoothness better predicts the progress of the gradient step because $M_k \ll L$. Our rates improve on $L$-smoothness because of this tighter bound.
Figure 3: Performance of GD with different step-size rules for a synthetic quadratic problem. We run GD for 20,000 steps on 20 random quadratic problems with $L=1000$ and Hessian skew. Left-to-right, the first plot shows the optimality gap $f(x_k) - f(x^*)$, the second shows the point-wise directional smoothness $D( x_k, x_{k+1})$, and the third shows step-sizes used by the different methods.
Figure 4: Comparison of GD with $\eta_k = 1 / L$, step-sizes strongly adapted to the point-wise smoothness ($\eta_k = 1/D( x_k, x_{k+1})$), and the Polyak step-size against normalized GD (Norm. GD) and the AdGD method on three logistic regression problems. AdGD uses a smoothed version of the point-wise directional smoothness from the previous iteration to set $\eta_k$. We find that GD methods with adaptive step-sizes consistently outperform GD with $\eta_k = 1 / L$ and even obtain a linear rate on horse-colic.

Theorems & Definitions (60)

Definition 2.1
Lemma 2.1
Proposition 2.1
Lemma 2.1
Proposition 3.0
Proposition 3.0
Proposition 3.0
Theorem 3.1
Proposition 4.0
Proposition 4.0
...and 50 more

Directional Smoothness and Gradient Methods: Convergence and Adaptivity

TL;DR

Abstract

Directional Smoothness and Gradient Methods: Convergence and Adaptivity

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (60)