Gradient descent with generalized Newton's method
Zhiqi Bu, Shiyun Xu
TL;DR
GeN introduces a Hessian-informed, automatic learning-rate mechanism that can be attached to any base optimizer, deriving the optimal step size from a second-order Taylor expansion. By performing a small number of forward passes, GeN fits a local quadratic to estimate the coefficients that govern the optimal learning rate, avoiding explicit Hessian construction or back-propagation. The method recovers Newton's method as a special case and is designed to be computation- and memory-efficient through lazy updates and activation-free forward passes. Extensive experiments across image classification, natural language processing, object detection, and generation demonstrate that GeN matches or surpasses state-of-the-art LR schedulers with minimal tuning, while remaining scalable to large models and distributed training.
Abstract
We propose the generalized Newton's method (GeN) -- a Hessian-informed approach that applies to any optimizer such as SGD and Adam, and covers the Newton-Raphson method as a sub-case. Our method automatically and dynamically selects the learning rate that accelerates the convergence, without the intensive tuning of the learning rate scheduler. In practice, our method is easily implementable, since it only requires additional forward passes with almost zero computational overhead (in terms of training time and memory cost), if the overhead is amortized over many iterations. We present extensive experiments on language and vision tasks (e.g. GPT and ResNet) to showcase that GeN optimizers match the state-of-the-art performance, which was achieved with carefully tuned learning rate schedulers.
