An adaptive framework for first-order gradient methods

Xiaozhe Hu; Sara Pollock; Zhongqin Xue; Yunrong Zhu

An adaptive framework for first-order gradient methods

Xiaozhe Hu, Sara Pollock, Zhongqin Xue, Yunrong Zhu

TL;DR

The paper tackles optimizing first-order gradient methods when the strong convexity parameter $\mu$ is unknown by introducing a unified adaptive framework that uses the geometric mean of successive residual-ratio estimates to form an empirical convergence-rate bound $\rho^*$. This bound guides adaptive updates of step size $\alpha$ and momentum $\beta$ for GD, NAG, and HB, with $L$ normalized to $1$ to focus on curvature exploitation. The authors prove that the adaptive schemes converge no slower than gradient descent with $\alpha=1/L$ and demonstrate through quadratic, logistic regression, and Huber-TV denoising experiments that these methods achieve competitive performance with optimal-parameter accelerators while adapting to local curvature. The approach offers a practical, simple-to-implement mechanism that captures local structure and improves robustness across varied problem classes, with potential for broad applicability in first-order optimization.

Abstract

Gradient methods are widely used in optimization problems. In practice, while the smoothness parameter can be estimated utilizing techniques such as backtracking, estimating the strong convexity parameter remains a challenge; moreover, even with the optimal parameter choice, convergence can be slow. In this work, we propose a framework for dynamically adapting the step size and momentum parameters in first-order gradient methods for the optimization problem, without prior knowledge of the strong convexity parameter. The main idea is to use the geometric average of the ratios of successive residual norms as an empirical estimate of the upper bound on the convergence rate, which in turn allows us to adaptively update the algorithm parameters. The resulting algorithms are simple to implement, yet efficient in practice, requiring only a few additional computations on existing information. The proposed adaptive gradient methods are shown to converge at least as fast as gradient descent for quadratic optimization problems. Numerical experiments on both quadratic and nonlinear problems validate the effectiveness of the proposed adaptive algorithms. The results show that the adaptive algorithms are comparable to their counterparts using optimal parameters, and in some cases, they capture local information and exhibit improved performance.

An adaptive framework for first-order gradient methods

TL;DR

The paper tackles optimizing first-order gradient methods when the strong convexity parameter

is unknown by introducing a unified adaptive framework that uses the geometric mean of successive residual-ratio estimates to form an empirical convergence-rate bound

. This bound guides adaptive updates of step size

and momentum

for GD, NAG, and HB, with

normalized to

to focus on curvature exploitation. The authors prove that the adaptive schemes converge no slower than gradient descent with

and demonstrate through quadratic, logistic regression, and Huber-TV denoising experiments that these methods achieve competitive performance with optimal-parameter accelerators while adapting to local curvature. The approach offers a practical, simple-to-implement mechanism that captures local structure and improves robustness across varied problem classes, with potential for broad applicability in first-order optimization.

Abstract

Paper Structure (10 sections, 5 theorems, 74 equations, 20 figures, 1 table, 3 algorithms)

This paper contains 10 sections, 5 theorems, 74 equations, 20 figures, 1 table, 3 algorithms.

Introduction
Adaptive Algorithms
Adaptive Gradient Descent
Adaptive Accelerated Gradient Descent
Adaptive Nesterov Acceleration
Numerical Experiments
Quadratic optimization problem
Logistic regression problem with regularization
Huber-TV regularized image denoising
Conclusion

Key Result

Theorem 2.1

If $L < 1$, let $\rho_k = \rho(I - \alpha_k A)$ in alg:AGD. Then $\rho_k$ satisfies

Figures (20)

Figure 1: Error (left) and estimated $\rho^*$ (right) for GD on a diagonal matrix with random eigenvalue distribution satisfying $L>\frac{2}{2-\mu}-\mu$.
Figure 2: Error (left) and estimated $\rho^*$ (right) for GD on a diagonal matrix with random eigenvalue distribution satisfying $L\leq\frac{2}{2-\mu}-\mu$.
Figure 3: Error (left) and estimated $\rho^*$ (right) for GD on a diagonal matrix with uniform eigenvalue distribution ($n=1000$).
Figure 4: Error (left) and estimated $\rho^*$ (right) for GD on a diagonal matrix with log-spaced eigenvalue distribution ($n=1000$).
Figure 5: Error (left) and estimated $\rho^*$ (right) for GD on a diagonal matrix with clustered eigenvalue distribution ($n=1000$).
...and 15 more figures

Theorems & Definitions (15)

Theorem 2.1
Proof 1
Remark 2.2
Remark 2.3
Lemma 2.4
Proof 2
Theorem 2.5
Proof 3
Remark 2.6
Lemma 2.7
...and 5 more

An adaptive framework for first-order gradient methods

TL;DR

Abstract

An adaptive framework for first-order gradient methods

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (20)

Theorems & Definitions (15)