Why Line Search when you can Plane Search? SO-Friendly Neural Networks allow Per-Iteration Optimization of Learning and Momentum Rates for Every Layer
Betty Shea, Mark Schmidt
TL;DR
The paper introduces subspace optimization (SO) as a practical enhancement to gradient-based methods by enabling per-iteration tuning of learning and momentum rates in a low-dimensional subspace, particularly for linear-composition problems (LCPs) and SO-friendly neural networks. It shows that line optimization ($LO$) and plane search ($PS$) can be implemented with the same asymptotic cost as fixed-step methods in many settings, enabling rapid, robust optimization across GD+M, quasi-Newton, and Adam variants. Through extensive experiments on logistic regression and two-layer networks, LO and SO consistently outperform traditional line searches and fixed-step strategies, often with per-layer rates further improving convergence when problem structure permits. The work also surveys the historical development of SO methods and discusses applicability, limitations, and future directions for integrating SO with SGD, deep learning, and specialized problem classes like matrix-factorization and log-determinant problems. Overall, the results suggest that, for suitable problem structures, LO and SO provide fast, hyper-parameter-insensitive training with practical computational costs.
Abstract
We introduce the class of SO-friendly neural networks, which include several models used in practice including networks with 2 layers of hidden weights where the number of inputs is larger than the number of outputs. SO-friendly networks have the property that performing a precise line search to set the step size on each iteration has the same asymptotic cost during full-batch training as using a fixed learning. Further, for the same cost a planesearch can be used to set both the learning and momentum rate on each step. Even further, SO-friendly networks also allow us to use subspace optimization to set a learning rate and momentum rate for each layer on each iteration. We explore augmenting gradient descent as well as quasi-Newton methods and Adam with line optimization and subspace optimization, and our experiments indicate that this gives fast and reliable ways to train these networks that are insensitive to hyper-parameters.
