Symmetric Rank-One Quasi-Newton Methods for Deep Learning Using Cubic Regularization
Aditya Ranganath, Mukesh Singhal, Roummel Marcia
TL;DR
This paper addresses the inefficiency of first-order methods in training nonconvex deep neural networks by integrating curvature information via indefinite Hessian approximations. It introduces ARCs-LSR1, which combines a limited-memory Symmetric Rank-One (L-SR1) Hessian update with Adaptive Regularization using Cubics (ARCs) and a shape-changing norm that yields a closed-form solution to the cubic subproblem. The authors establish convergence guarantees and demonstrate through extensive experiments in image classification, image reconstruction, and language modeling that ARCs-LSR1 often outperforms adaptive first-order methods and standard quasi-Newton approaches, with a stochastic variant offering scalable optimization. The work shows that exploiting negative curvature directions within a cubic-regularized framework can achieve faster convergence and higher accuracy in deep learning, providing a practical curvature-aware alternative for training large-scale networks.
Abstract
Stochastic gradient descent and other first-order variants, such as Adam and AdaGrad, are commonly used in the field of deep learning due to their computational efficiency and low-storage memory requirements. However, these methods do not exploit curvature information. Consequently, iterates can converge to saddle points or poor local minima. On the other hand, Quasi-Newton methods compute Hessian approximations which exploit this information with a comparable computational budget. Quasi-Newton methods re-use previously computed iterates and gradients to compute a low-rank structured update. The most widely used quasi-Newton update is the L-BFGS, which guarantees a positive semi-definite Hessian approximation, making it suitable in a line search setting. However, the loss functions in DNNs are non-convex, where the Hessian is potentially non-positive definite. In this paper, we propose using a limited-memory symmetric rank-one quasi-Newton approach which allows for indefinite Hessian approximations, enabling directions of negative curvature to be exploited. Furthermore, we use a modified adaptive regularized cubics approach, which generates a sequence of cubic subproblems that have closed-form solutions with suitable regularization choices. We investigate the performance of our proposed method on autoencoders and feed-forward neural network models and compare our approach to state-of-the-art first-order adaptive stochastic methods as well as other quasi-Newton methods.x
