Directional Smoothness and Gradient Methods: Convergence and Adaptivity
Aaron Mishkin, Ahmed Khaled, Yuanhao Wang, Aaron Defazio, Robert M. Gower
TL;DR
This work replaces global $L$-smoothness with directional smoothness $M$ to derive path-dependent sub-optimality bounds for gradient methods. It develops a hierarchy of directional smoothness notions (point-wise $D$, path-wise $A$, and optimal $H$) and shows how adapting step-sizes to $M$ tightens progress, with extensions to acceleration. In the quadratic case, strongly adapted steps align with Rayleigh quotients and recover classical step-sizes like the Cauchy step-size; for general convex functions, exponential search and Polyak's step-size provide fast path-dependent rates without requiring explicit knowledge of $M$. Normalized GD also attains fast rates under mild conditions, and extensive logistic-regression experiments demonstrate substantially tighter convergence behavior than traditional $L$-smooth analyses. Overall, the paper offers a practical, path-aware framework for adaptive gradient methods with strong theoretical and empirical support.
Abstract
We develop new sub-optimality bounds for gradient descent (GD) that depend on the conditioning of the objective along the path of optimization rather than on global, worst-case constants. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective. Minimizing these upper-bounds requires solving implicit equations to obtain a sequence of strongly adapted step-sizes; we show that these equations are straightforward to solve for convex quadratics and lead to new guarantees for two classical step-sizes. For general functions, we prove that the Polyak step-size and normalized GD obtain fast, path-dependent rates despite using no knowledge of the directional smoothness. Experiments on logistic regression show our convergence guarantees are tighter than the classical theory based on $L$-smoothness.
