Table of Contents
Fetching ...

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

Natalie Abreu, Nikhil Vyas, Sham Kakade, Depen Morwani

TL;DR

This study investigates the practical limits of second-order optimization for large language models by applying full Gauss-Newton preconditioning to transformers up to 150M parameters. By comparing full GN, GN-prox-linear, and layerwise GN, the authors demonstrate substantial iteration-time reductions (up to $5.4\times$ over SOAP) and enhanced batch-size scaling, with layerwise GN capturing most of the gains while avoiding cross-layer curvature complexity. Using memory-efficient Jacobian-vector products, the work provides empirical evidence that higher-order terms beyond GN are not strictly necessary for convergence speed, and that layerwise curvature information is often sufficient for large-scale gains. The results offer a target for future, more practical second-order methods and suggest that advancing layerwise Hessian approximations could yield major efficiency improvements in LLM training. Overall, the paper frames a concrete optimization-performance frontier for second-order methods in large transformers and highlights the potential pathways to practical, scalable preconditioning.

Abstract

Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle.

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

TL;DR

This study investigates the practical limits of second-order optimization for large language models by applying full Gauss-Newton preconditioning to transformers up to 150M parameters. By comparing full GN, GN-prox-linear, and layerwise GN, the authors demonstrate substantial iteration-time reductions (up to over SOAP) and enhanced batch-size scaling, with layerwise GN capturing most of the gains while avoiding cross-layer curvature complexity. Using memory-efficient Jacobian-vector products, the work provides empirical evidence that higher-order terms beyond GN are not strictly necessary for convergence speed, and that layerwise curvature information is often sufficient for large-scale gains. The results offer a target for future, more practical second-order methods and suggest that advancing layerwise Hessian approximations could yield major efficiency improvements in LLM training. Overall, the paper frames a concrete optimization-performance frontier for second-order methods in large transformers and highlights the potential pathways to practical, scalable preconditioning.

Abstract

Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle.

Paper Structure

This paper contains 38 sections, 21 equations, 6 figures, 1 table, 2 algorithms.

Figures (6)

  • Figure 1: Training step versus validation loss until loss 3.25 when each method is beyond its critical batch size. Gauss-Newton and Layerwise Gauss-Newton reach the target loss in 54 and 78 steps respectively, compared to 292 steps for SOAP.
  • Figure 2: Left: Batch size vs final validation loss for models trained for Chinchilla-optimal number of tokens. The dotted line marks the loss achieved by a model trained with Muon with batch size 128k. This represents the upper bound of performance for our Gauss-Newton method. Right: Critical batch size scaling. The dotted line marks the optimal scaling trend, where no sample efficiency is lost as batch size increases.
  • Figure 3: Left: Comparison of Gauss-Newton to the layerwise implementation for Chinchilla-optimal token count for 150M parameter models. The layerwise method achieves almost matching performance to that of the full Gauss-Newton. Right: The Gauss-Newton update closely matches the GN-prox-linear method that has access to higher order loss terms.
  • Figure 4: Three learning rate schedules used for the Gauss-Newton and GN-prox-linear runs. From left to right: global cosine, global+inner cosine, and constant+inner cosine. Each inner cosine period lasts the duration of the optimization over the current Taylor expansion; outer step refers to each parameter update on the model.
  • Figure 5: Resulting step sizes used from line search for Gauss-Newton and layerwise Gauss-Newton.
  • ...and 1 more figures