Why Line Search when you can Plane Search? SO-Friendly Neural Networks allow Per-Iteration Optimization of Learning and Momentum Rates for Every Layer

Betty Shea; Mark Schmidt

Why Line Search when you can Plane Search? SO-Friendly Neural Networks allow Per-Iteration Optimization of Learning and Momentum Rates for Every Layer

Betty Shea, Mark Schmidt

TL;DR

The paper introduces subspace optimization (SO) as a practical enhancement to gradient-based methods by enabling per-iteration tuning of learning and momentum rates in a low-dimensional subspace, particularly for linear-composition problems (LCPs) and SO-friendly neural networks. It shows that line optimization ($LO$) and plane search ($PS$) can be implemented with the same asymptotic cost as fixed-step methods in many settings, enabling rapid, robust optimization across GD+M, quasi-Newton, and Adam variants. Through extensive experiments on logistic regression and two-layer networks, LO and SO consistently outperform traditional line searches and fixed-step strategies, often with per-layer rates further improving convergence when problem structure permits. The work also surveys the historical development of SO methods and discusses applicability, limitations, and future directions for integrating SO with SGD, deep learning, and specialized problem classes like matrix-factorization and log-determinant problems. Overall, the results suggest that, for suitable problem structures, LO and SO provide fast, hyper-parameter-insensitive training with practical computational costs.

Abstract

We introduce the class of SO-friendly neural networks, which include several models used in practice including networks with 2 layers of hidden weights where the number of inputs is larger than the number of outputs. SO-friendly networks have the property that performing a precise line search to set the step size on each iteration has the same asymptotic cost during full-batch training as using a fixed learning. Further, for the same cost a planesearch can be used to set both the learning and momentum rate on each step. Even further, SO-friendly networks also allow us to use subspace optimization to set a learning rate and momentum rate for each layer on each iteration. We explore augmenting gradient descent as well as quasi-Newton methods and Adam with line optimization and subspace optimization, and our experiments indicate that this gives fast and reliable ways to train these networks that are insensitive to hyper-parameters.

Why Line Search when you can Plane Search? SO-Friendly Neural Networks allow Per-Iteration Optimization of Learning and Momentum Rates for Every Layer

TL;DR

) and plane search (

) can be implemented with the same asymptotic cost as fixed-step methods in many settings, enabling rapid, robust optimization across GD+M, quasi-Newton, and Adam variants. Through extensive experiments on logistic regression and two-layer networks, LO and SO consistently outperform traditional line searches and fixed-step strategies, often with per-layer rates further improving convergence when problem structure permits. The work also surveys the historical development of SO methods and discusses applicability, limitations, and future directions for integrating SO with SGD, deep learning, and specialized problem classes like matrix-factorization and log-determinant problems. Overall, the results suggest that, for suitable problem structures, LO and SO provide fast, hyper-parameter-insensitive training with practical computational costs.

Abstract

Paper Structure (25 sections, 27 equations, 22 figures)

This paper contains 25 sections, 27 equations, 22 figures.

Should we use Subspace Optimization in Machine Learning?
List of Contributions by Figure
Comments on Limitations of the Applicability of Subspace Optimization
Line Search and Plane Search for Linear Composition Problems (LCPs)
Efficient Line Search (LS) and Line Optimization (LO)
Efficient Plane Search (PS) for the Memory Gradient Method
LO and SO for LCPs in Practice
The Scattered 50+ Year History of Subspace Optimization Methods
Line Search and [Hyper-]Plane Search for Neural Networks
Definition: SO-Friendly Neural Networks
Example: 2-Layer Networks with a Single Output - Tied Step Size(s)
LO and SO for 2-Layer Networks in Practice (Tied Step Sizes)
Example: 2-Layer Networks with a Single Output - Per-Layer Step Sizes
LO and SO for 2-Layer Networks in Practice (Per-Layer Step Sizes)
Other Examples of SO-Friendly Networks
...and 10 more sections

Figures (22)

Figure 1: Valid step sizes allowed by Armijo condition (in blue) and update minimizing function (in red) by line optimization. For a fixed $\sigma$, the improvement of line optimization over the step sizes allowed by the Armijo condition can be made arbitrarily large even for convex functions.
Figure 2: Performance of different gradient-based methods for fitting logistic regression models. Each plot is a different dataset. The black line only backtracks, the blue lines use a line search that can decrease or increase the step size to satisfy the strong Wolfe conditions, the orange lines use LO, and the magenta line uses SO. The GD methods use the gradient direction, the GD+M(L*) methods use the gradient direction and momentum with the non-linear conjugate gradient relationship between the parameters, and the GD+M(SO) method optimizes the learning rate and momentum rate. We see that LS methods tend to dominate the 1/L method, GD+M methods tend to dominate GD methods, LO methods tend to dominate LS methods, and the best performance on every dataset was achieved with SO.
Figure 3: Step sizes of different gradient descent methods for fitting logistic regression models. Note that the GD(LO) step sizes tended to lead to the fastest convergence while the GD(1/L) step sizes always converged slowest.
Figure 4: Step sizes of different gradient plus momentum methods for fitting logistic regression models. The solid lines are the learning rates and the dashed lines are the momentum rates. Note that the GD+M(SO) step sizes tended to lead to the fastest convergence while the GD+M(LS) step sizes usually converged slowest. A star is used to indicate the iteration where the momentum rate was negative.
Figure 5: Performance of different gradient-based methods for fitting logistic regression models. The black lines only backtrack, the orange line uses LO, the magenta line uses a two-dimensional SO, the light green line uses a three-dimensional SO, and the dark green line uses a four-dimensional SO. The NAG(1/L) is an implementation of Nesterov's accelerated gradient method. The NAG(SO) adds "gradient momentum" to the memory gradient method GD+M(SO), while the SNAG(SO) method further adds scaling of the parameter vector. We see that acceleration on its own tends to be less effective than using LO with an appropriate direction. We also see that adding additional directions improves performance but that only small gains tend to be observed by optimizing over more than 2 directions.
...and 17 more figures

Why Line Search when you can Plane Search? SO-Friendly Neural Networks allow Per-Iteration Optimization of Learning and Momentum Rates for Every Layer

TL;DR

Abstract

Why Line Search when you can Plane Search? SO-Friendly Neural Networks allow Per-Iteration Optimization of Learning and Momentum Rates for Every Layer

Authors

TL;DR

Abstract

Table of Contents

Figures (22)