Table of Contents
Fetching ...

A Learn-to-Optimize Approach for Coordinate-Wise Step Sizes for Quasi-Newton Methods

Wei Lin, Qingyu Song, Hong Xu

TL;DR

The paper addresses the challenge of tuning step sizes for second-order optimization by focusing on coordinate-wise step sizes within the BFGS framework. It provides a theoretical foundation establishing sufficient conditions for stable, convergent CWSS and then designs an LSTM-based learn-to-optimize approach that predicts CWSS within a safe interval to preserve convergence properties. The proposed BFGS-L2O method, trained with a two-loop scheme, demonstrates up to 4x faster convergence than baseline scalar-step or hypergradient-based methods across least-squares, logistic regression, log-sum-exp, and CNN training tasks, with improved stability. This work offers a scalable, data-driven mechanism to leverage curvature information more effectively in quasi-Newton methods, potentially broadening the practical use of second-order optimization in large-scale problems.

Abstract

Tuning step sizes is crucial for the stability and efficiency of optimization algorithms. While adaptive coordinate-wise step sizes have been shown to outperform scalar step size in first-order methods, their use in second-order methods is still under-explored and more challenging. Current approaches, including hypergradient descent and cutting plane methods, offer limited improvements or encounter difficulties in second-order contexts. To address these limitations, we first conduct a theoretical analysis within the Broyden-Fletcher-Goldfarb-Shanno (BFGS) framework, a prominent quasi-Newton method, and derive sufficient conditions for coordinate-wise step sizes that ensure convergence and stability. Building on this theoretical foundation, we introduce a novel learn-to-optimize (L2O) method that employs LSTM-based networks to learn optimal step sizes by leveraging insights from past optimization trajectories, while inherently respecting the derived theoretical guarantees. Extensive experiments demonstrate that our approach achieves substantial improvements over scalar step size methods and hypergradient descent-based method, offering up to 4$\times$ faster convergence across diverse optimization tasks.

A Learn-to-Optimize Approach for Coordinate-Wise Step Sizes for Quasi-Newton Methods

TL;DR

The paper addresses the challenge of tuning step sizes for second-order optimization by focusing on coordinate-wise step sizes within the BFGS framework. It provides a theoretical foundation establishing sufficient conditions for stable, convergent CWSS and then designs an LSTM-based learn-to-optimize approach that predicts CWSS within a safe interval to preserve convergence properties. The proposed BFGS-L2O method, trained with a two-loop scheme, demonstrates up to 4x faster convergence than baseline scalar-step or hypergradient-based methods across least-squares, logistic regression, log-sum-exp, and CNN training tasks, with improved stability. This work offers a scalable, data-driven mechanism to leverage curvature information more effectively in quasi-Newton methods, potentially broadening the practical use of second-order optimization in large-scale problems.

Abstract

Tuning step sizes is crucial for the stability and efficiency of optimization algorithms. While adaptive coordinate-wise step sizes have been shown to outperform scalar step size in first-order methods, their use in second-order methods is still under-explored and more challenging. Current approaches, including hypergradient descent and cutting plane methods, offer limited improvements or encounter difficulties in second-order contexts. To address these limitations, we first conduct a theoretical analysis within the Broyden-Fletcher-Goldfarb-Shanno (BFGS) framework, a prominent quasi-Newton method, and derive sufficient conditions for coordinate-wise step sizes that ensure convergence and stability. Building on this theoretical foundation, we introduce a novel learn-to-optimize (L2O) method that employs LSTM-based networks to learn optimal step sizes by leveraging insights from past optimization trajectories, while inherently respecting the derived theoretical guarantees. Extensive experiments demonstrate that our approach achieves substantial improvements over scalar step size methods and hypergradient descent-based method, offering up to 4 faster convergence across diverse optimization tasks.

Paper Structure

This paper contains 32 sections, 3 theorems, 49 equations, 4 figures, 1 table.

Key Result

Theorem 1

Let $\{x_k\}$ be the sequence generated by equation eq:bfgs_update. If the coordinate-wise step size $P_k$ satisfies for certain $0<\alpha< 2$ and $\beta > 0$, where $L$ is the Lipschitz constant of gradients and $B_k$ is the approximate Hessian generated by BFGS, then the sequence of gradients converges to zero: $\lim _{k\to \infty}\|\nabla f(x_k)\|_2=0$.

Figures (4)

  • Figure 1: Least Squares.
  • Figure 2: Logistic Regression.
  • Figure 3: Simple CNN.
  • Figure 4: Log-sum-exp functions with different dimensions.

Theorems & Definitions (8)

  • Theorem 1
  • Remark
  • Theorem 2
  • Remark
  • Theorem 3
  • proof
  • proof
  • proof