Table of Contents
Fetching ...

Incremental Gauss-Newton Descent for Machine Learning

Mikalai Korbit, Mario Zanon

TL;DR

This work introduces Incremental Gauss-Newton Descent (IGND), an incremental optimization method that injects approximate second-order information into SGD-like updates without incurring the full cost of Hessian calculations. By deriving a GN-based Hessian approximation and showing that, for a single-sample update, the GN step reduces to a scalar scaling ξ that multiplies a standard gradient-like direction, IGND achieves scale-aware updates with essentially SGD-like computational burden. The authors prove convergence under standard assumptions and demonstrate through supervised learning and reinforcement learning experiments that IGND converges at least as fast as SGD and often significantly faster, while requiring considerably less hyperparameter tuning. IGND is shown to be versatile: it applies to value-based RL (IGNDQ), LQR policy evaluation, and can be accelerated with existing first-order methods. The practical impact is a robust, scalable optimizer that harnesses curvature information in a computationally light way, improving training dynamics across domains.

Abstract

Stochastic Gradient Descent (SGD) is a popular technique used to solve problems arising in machine learning. While very effective, SGD also has some weaknesses and various modifications of the basic algorithm have been proposed in order to at least partially tackle them, mostly yielding accelerated versions of SGD. Filling a gap in the literature, we present a modification of the SGD algorithm exploiting approximate second-order information based on the Gauss-Newton approach. The new method, which we call Incremental Gauss-Newton Descent (IGND), has essentially the same computational burden as standard SGD, appears to converge faster on certain classes of problems, and can also be accelerated. The key intuition making it possible to implement IGND efficiently is that, in the incremental case, approximate second-order information can be condensed into a scalar value that acts as a scaling constant of the update. We derive IGND starting from the theory supporting Gauss-Newton methods in a general setting and then explain how IGND can also be interpreted as a well-scaled version of SGD, which makes tuning the algorithm simpler, and provides increased robustness. Finally, we show how IGND can be used in practice by solving supervised learning tasks as well as reinforcement learning problems. The simulations show that IGND can significantly outperform SGD while performing at least as well as SGD in the worst case.

Incremental Gauss-Newton Descent for Machine Learning

TL;DR

This work introduces Incremental Gauss-Newton Descent (IGND), an incremental optimization method that injects approximate second-order information into SGD-like updates without incurring the full cost of Hessian calculations. By deriving a GN-based Hessian approximation and showing that, for a single-sample update, the GN step reduces to a scalar scaling ξ that multiplies a standard gradient-like direction, IGND achieves scale-aware updates with essentially SGD-like computational burden. The authors prove convergence under standard assumptions and demonstrate through supervised learning and reinforcement learning experiments that IGND converges at least as fast as SGD and often significantly faster, while requiring considerably less hyperparameter tuning. IGND is shown to be versatile: it applies to value-based RL (IGNDQ), LQR policy evaluation, and can be accelerated with existing first-order methods. The practical impact is a robust, scalable optimizer that harnesses curvature information in a computationally light way, improving training dynamics across domains.

Abstract

Stochastic Gradient Descent (SGD) is a popular technique used to solve problems arising in machine learning. While very effective, SGD also has some weaknesses and various modifications of the basic algorithm have been proposed in order to at least partially tackle them, mostly yielding accelerated versions of SGD. Filling a gap in the literature, we present a modification of the SGD algorithm exploiting approximate second-order information based on the Gauss-Newton approach. The new method, which we call Incremental Gauss-Newton Descent (IGND), has essentially the same computational burden as standard SGD, appears to converge faster on certain classes of problems, and can also be accelerated. The key intuition making it possible to implement IGND efficiently is that, in the incremental case, approximate second-order information can be condensed into a scalar value that acts as a scaling constant of the update. We derive IGND starting from the theory supporting Gauss-Newton methods in a general setting and then explain how IGND can also be interpreted as a well-scaled version of SGD, which makes tuning the algorithm simpler, and provides increased robustness. Finally, we show how IGND can be used in practice by solving supervised learning tasks as well as reinforcement learning problems. The simulations show that IGND can significantly outperform SGD while performing at least as well as SGD in the worst case.
Paper Structure (20 sections, 6 theorems, 81 equations, 4 figures, 4 tables, 4 algorithms)

This paper contains 20 sections, 6 theorems, 81 equations, 4 figures, 4 tables, 4 algorithms.

Key Result

Lemma 3.1

The solution of Problem eq:regularized_GN_step is also optimal for Problem eq:naive_GN_step_incremental.

Figures (4)

  • Figure 1: Learning curves on the test set for SGD, IGND, Adam and Adam applied on IGND-scaled gradients. The shaded area represents $\pm1$ standard deviation around the mean (thick line) for $20$ seeds.
  • Figure 2: Learning curves for QL and IGNDQ on FrozenLake-v1 environment with original (left plot) and scaled (right plot) features. The shaded area represents $\pm 1$ standard deviation around the mean return (thick line) for 20 seeds.
  • Figure 3: Learning curves for QL and IGNDQ on original Acrobot-v1 and CartPole-v1 and modified environments. The shaded area represents $\pm 1$ standard deviation around the mean return (thick line) for 20 seeds. In the left plots, the black dashed line shows the performance of the CleanRL DQN trained agent.
  • Figure 4: QL and IGNDQ learning curves against the optimal LQR controller for the deterministic and stochastic LQR formulations. The shaded area represents $\pm 1$ standard deviation around the mean (thick line) return for 20 seeds.

Theorems & Definitions (12)

  • Lemma 3.1
  • proof
  • Theorem 3.2
  • proof
  • Lemma 3.3
  • proof
  • Lemma 3.4
  • proof
  • Theorem 3.5: From bottou2018optimization
  • proof
  • ...and 2 more