Table of Contents
Fetching ...

Gradient descent with generalized Newton's method

Zhiqi Bu, Shiyun Xu

TL;DR

GeN introduces a Hessian-informed, automatic learning-rate mechanism that can be attached to any base optimizer, deriving the optimal step size from a second-order Taylor expansion. By performing a small number of forward passes, GeN fits a local quadratic to estimate the coefficients that govern the optimal learning rate, avoiding explicit Hessian construction or back-propagation. The method recovers Newton's method as a special case and is designed to be computation- and memory-efficient through lazy updates and activation-free forward passes. Extensive experiments across image classification, natural language processing, object detection, and generation demonstrate that GeN matches or surpasses state-of-the-art LR schedulers with minimal tuning, while remaining scalable to large models and distributed training.

Abstract

We propose the generalized Newton's method (GeN) -- a Hessian-informed approach that applies to any optimizer such as SGD and Adam, and covers the Newton-Raphson method as a sub-case. Our method automatically and dynamically selects the learning rate that accelerates the convergence, without the intensive tuning of the learning rate scheduler. In practice, our method is easily implementable, since it only requires additional forward passes with almost zero computational overhead (in terms of training time and memory cost), if the overhead is amortized over many iterations. We present extensive experiments on language and vision tasks (e.g. GPT and ResNet) to showcase that GeN optimizers match the state-of-the-art performance, which was achieved with carefully tuned learning rate schedulers.

Gradient descent with generalized Newton's method

TL;DR

GeN introduces a Hessian-informed, automatic learning-rate mechanism that can be attached to any base optimizer, deriving the optimal step size from a second-order Taylor expansion. By performing a small number of forward passes, GeN fits a local quadratic to estimate the coefficients that govern the optimal learning rate, avoiding explicit Hessian construction or back-propagation. The method recovers Newton's method as a special case and is designed to be computation- and memory-efficient through lazy updates and activation-free forward passes. Extensive experiments across image classification, natural language processing, object detection, and generation demonstrate that GeN matches or surpasses state-of-the-art LR schedulers with minimal tuning, while remaining scalable to large models and distributed training.

Abstract

We propose the generalized Newton's method (GeN) -- a Hessian-informed approach that applies to any optimizer such as SGD and Adam, and covers the Newton-Raphson method as a sub-case. Our method automatically and dynamically selects the learning rate that accelerates the convergence, without the intensive tuning of the learning rate scheduler. In practice, our method is easily implementable, since it only requires additional forward passes with almost zero computational overhead (in terms of training time and memory cost), if the overhead is amortized over many iterations. We present extensive experiments on language and vision tasks (e.g. GPT and ResNet) to showcase that GeN optimizers match the state-of-the-art performance, which was achieved with carefully tuned learning rate schedulers.
Paper Structure (56 sections, 3 theorems, 35 equations, 15 figures, 8 tables, 2 algorithms)

This paper contains 56 sections, 3 theorems, 35 equations, 15 figures, 8 tables, 2 algorithms.

Key Result

Proposition 3.3

If $\mathbf{g}_t^\text{optim}=\mathbf{H}_t^{-1}\mathbf{G}_t$, (eq:autosgd) reduces to the Newton's method as $\eta_t^*\mathbf{g}_t^\text{optim}=\mathbf{H}_t^{-1}\mathbf{G}_t$.

Figures (15)

  • Figure 1: Effects of various learning rate schedulers. Left two: ResNet18 on CIFAR100 dataset, compared with constant learning rates. Right two: GPT2 on E2E dataset, compared with heuristic learning rate schedulers.
  • Figure 2: Illustration of the second-order Taylor expansion in (\ref{['eq:lr parabola']}). Left two: ResNet18 on CIFAR100 with SGD. Right two: GPT2 on E2E with AdamW.
  • Figure 3: Convergence of ResNet18 on CIFAR10, optimized by GeN-SGD with various batch sizes.
  • Figure 4: Convergence of GeN-SGD (upper panel) and GeN-AdamW (lower panel) on CIFAR10 with various model architectures and sizes.
  • Figure 5: Convergence of ResNet18 on CIFAR10, optimized by GeN-SGD with various $\Phi$.
  • ...and 10 more figures

Theorems & Definitions (11)

  • Remark 3.1
  • Remark 3.2
  • Proposition 3.3
  • Remark 3.4
  • Remark 3.5
  • Proposition 3.6
  • Definition A.1
  • proof
  • proof : Proof of \ref{['prop:opb error']}
  • Theorem 1
  • ...and 1 more