Table of Contents
Fetching ...

Why Line Search when you can Plane Search? SO-Friendly Neural Networks allow Per-Iteration Optimization of Learning and Momentum Rates for Every Layer

Betty Shea, Mark Schmidt

TL;DR

The paper introduces subspace optimization (SO) as a practical enhancement to gradient-based methods by enabling per-iteration tuning of learning and momentum rates in a low-dimensional subspace, particularly for linear-composition problems (LCPs) and SO-friendly neural networks. It shows that line optimization ($LO$) and plane search ($PS$) can be implemented with the same asymptotic cost as fixed-step methods in many settings, enabling rapid, robust optimization across GD+M, quasi-Newton, and Adam variants. Through extensive experiments on logistic regression and two-layer networks, LO and SO consistently outperform traditional line searches and fixed-step strategies, often with per-layer rates further improving convergence when problem structure permits. The work also surveys the historical development of SO methods and discusses applicability, limitations, and future directions for integrating SO with SGD, deep learning, and specialized problem classes like matrix-factorization and log-determinant problems. Overall, the results suggest that, for suitable problem structures, LO and SO provide fast, hyper-parameter-insensitive training with practical computational costs.

Abstract

We introduce the class of SO-friendly neural networks, which include several models used in practice including networks with 2 layers of hidden weights where the number of inputs is larger than the number of outputs. SO-friendly networks have the property that performing a precise line search to set the step size on each iteration has the same asymptotic cost during full-batch training as using a fixed learning. Further, for the same cost a planesearch can be used to set both the learning and momentum rate on each step. Even further, SO-friendly networks also allow us to use subspace optimization to set a learning rate and momentum rate for each layer on each iteration. We explore augmenting gradient descent as well as quasi-Newton methods and Adam with line optimization and subspace optimization, and our experiments indicate that this gives fast and reliable ways to train these networks that are insensitive to hyper-parameters.

Why Line Search when you can Plane Search? SO-Friendly Neural Networks allow Per-Iteration Optimization of Learning and Momentum Rates for Every Layer

TL;DR

The paper introduces subspace optimization (SO) as a practical enhancement to gradient-based methods by enabling per-iteration tuning of learning and momentum rates in a low-dimensional subspace, particularly for linear-composition problems (LCPs) and SO-friendly neural networks. It shows that line optimization () and plane search () can be implemented with the same asymptotic cost as fixed-step methods in many settings, enabling rapid, robust optimization across GD+M, quasi-Newton, and Adam variants. Through extensive experiments on logistic regression and two-layer networks, LO and SO consistently outperform traditional line searches and fixed-step strategies, often with per-layer rates further improving convergence when problem structure permits. The work also surveys the historical development of SO methods and discusses applicability, limitations, and future directions for integrating SO with SGD, deep learning, and specialized problem classes like matrix-factorization and log-determinant problems. Overall, the results suggest that, for suitable problem structures, LO and SO provide fast, hyper-parameter-insensitive training with practical computational costs.

Abstract

We introduce the class of SO-friendly neural networks, which include several models used in practice including networks with 2 layers of hidden weights where the number of inputs is larger than the number of outputs. SO-friendly networks have the property that performing a precise line search to set the step size on each iteration has the same asymptotic cost during full-batch training as using a fixed learning. Further, for the same cost a planesearch can be used to set both the learning and momentum rate on each step. Even further, SO-friendly networks also allow us to use subspace optimization to set a learning rate and momentum rate for each layer on each iteration. We explore augmenting gradient descent as well as quasi-Newton methods and Adam with line optimization and subspace optimization, and our experiments indicate that this gives fast and reliable ways to train these networks that are insensitive to hyper-parameters.
Paper Structure (25 sections, 27 equations, 22 figures)

This paper contains 25 sections, 27 equations, 22 figures.

Figures (22)

  • Figure 1: Valid step sizes allowed by Armijo condition (in blue) and update minimizing function (in red) by line optimization. For a fixed $\sigma$, the improvement of line optimization over the step sizes allowed by the Armijo condition can be made arbitrarily large even for convex functions.
  • Figure 2: Performance of different gradient-based methods for fitting logistic regression models. Each plot is a different dataset. The black line only backtracks, the blue lines use a line search that can decrease or increase the step size to satisfy the strong Wolfe conditions, the orange lines use LO, and the magenta line uses SO. The GD methods use the gradient direction, the GD+M(L*) methods use the gradient direction and momentum with the non-linear conjugate gradient relationship between the parameters, and the GD+M(SO) method optimizes the learning rate and momentum rate. We see that LS methods tend to dominate the 1/L method, GD+M methods tend to dominate GD methods, LO methods tend to dominate LS methods, and the best performance on every dataset was achieved with SO.
  • Figure 3: Step sizes of different gradient descent methods for fitting logistic regression models. Note that the GD(LO) step sizes tended to lead to the fastest convergence while the GD(1/L) step sizes always converged slowest.
  • Figure 4: Step sizes of different gradient plus momentum methods for fitting logistic regression models. The solid lines are the learning rates and the dashed lines are the momentum rates. Note that the GD+M(SO) step sizes tended to lead to the fastest convergence while the GD+M(LS) step sizes usually converged slowest. A star is used to indicate the iteration where the momentum rate was negative.
  • Figure 5: Performance of different gradient-based methods for fitting logistic regression models. The black lines only backtrack, the orange line uses LO, the magenta line uses a two-dimensional SO, the light green line uses a three-dimensional SO, and the dark green line uses a four-dimensional SO. The NAG(1/L) is an implementation of Nesterov's accelerated gradient method. The NAG(SO) adds "gradient momentum" to the memory gradient method GD+M(SO), while the SNAG(SO) method further adds scaling of the parameter vector. We see that acceleration on its own tends to be less effective than using LO with an appropriate direction. We also see that adding additional directions improves performance but that only small gains tend to be observed by optimizing over more than 2 directions.
  • ...and 17 more figures