Table of Contents
Fetching ...

Scalable Thermodynamic Second-order Optimization

Kaelan Donatella, Samuel Duffield, Denis Melanson, Maxwell Aifer, Phoebe Klett, Rajath Salegame, Zach Belateche, Gavin Crooks, Antonio J. Martinez, Patrick J. Coles

TL;DR

This work tackles the high cost of second-order optimization in neural network training by transforming K-FAC into a thermodynamic-hardware–accelerated algorithm. By using a block-diagonal Kronecker-factor approximation and a thermodynamic linear-algebra framework, per-layer computations can be reduced toward the complexity of first-order methods while preserving natural-gradient benefits. The approach demonstrates linear speedups with network width $n$ and shows robustness to quantization, with empirical results on ViT-ImageNet and GNN-OGBG-molpcba indicating speedups over standard K-FAC and Adam, and quantization experiments suggesting practical low-precision deployment. This work suggests a viable path to scalable, hardware-assisted second-order training for large-scale models in vision and graph domains.

Abstract

Many hardware proposals have aimed to accelerate inference in AI workloads. Less attention has been paid to hardware acceleration of training, despite the enormous societal impact of rapid training of AI models. Physics-based computers, such as thermodynamic computers, offer an efficient means to solve key primitives in AI training algorithms. Optimizers that normally would be computationally out-of-reach (e.g., due to expensive matrix inversions) on digital hardware could be unlocked with physics-based hardware. In this work, we propose a scalable algorithm for employing thermodynamic computers to accelerate a popular second-order optimizer called Kronecker-factored approximate curvature (K-FAC). Our asymptotic complexity analysis predicts increasing advantage with our algorithm as $n$, the number of neurons per layer, increases. Numerical experiments show that even under significant quantization noise, the benefits of second-order optimization can be preserved. Finally, we predict substantial speedups for large-scale vision and graph problems based on realistic hardware characteristics.

Scalable Thermodynamic Second-order Optimization

TL;DR

This work tackles the high cost of second-order optimization in neural network training by transforming K-FAC into a thermodynamic-hardware–accelerated algorithm. By using a block-diagonal Kronecker-factor approximation and a thermodynamic linear-algebra framework, per-layer computations can be reduced toward the complexity of first-order methods while preserving natural-gradient benefits. The approach demonstrates linear speedups with network width and shows robustness to quantization, with empirical results on ViT-ImageNet and GNN-OGBG-molpcba indicating speedups over standard K-FAC and Adam, and quantization experiments suggesting practical low-precision deployment. This work suggests a viable path to scalable, hardware-assisted second-order training for large-scale models in vision and graph domains.

Abstract

Many hardware proposals have aimed to accelerate inference in AI workloads. Less attention has been paid to hardware acceleration of training, despite the enormous societal impact of rapid training of AI models. Physics-based computers, such as thermodynamic computers, offer an efficient means to solve key primitives in AI training algorithms. Optimizers that normally would be computationally out-of-reach (e.g., due to expensive matrix inversions) on digital hardware could be unlocked with physics-based hardware. In this work, we propose a scalable algorithm for employing thermodynamic computers to accelerate a popular second-order optimizer called Kronecker-factored approximate curvature (K-FAC). Our asymptotic complexity analysis predicts increasing advantage with our algorithm as , the number of neurons per layer, increases. Numerical experiments show that even under significant quantization noise, the benefits of second-order optimization can be preserved. Finally, we predict substantial speedups for large-scale vision and graph problems based on realistic hardware characteristics.

Paper Structure

This paper contains 26 sections, 26 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of the thermodynamic algorithm for K-FAC. On the left is shown a two-layer neural network with weight matrices $W_1$ and $W_2$ and activations $a_1, a_2, a_3$ that are stored on a digital device. From these quantities Kronecker factors $A_\ell$ and $B_\ell$ are computed and sent to the thermodynamic solver, which inverts them or solves a linear system where they enter as the positive semi-definite matrix. Then, the result is sent back to the digital device and the weights are updated. Note that this algorithm is easily parallelized, e.g., many thermodynamic solvers can be used to compute the K-FAC update rule (Eq. \ref{['eq:K-FAC_update']}) for one or more layers each.
  • Figure 2: Profiling of the K-FAC update for different architectures. Panel (a): K-FAC update time contributions for an MLP with a fixed depth of 50, with varying number of neurons $n$ on each layer. Panel (b): K-FAC update time contributions for a GPT architecture (based on Ref. Karpathy2022). with varying embedding dimension, which is the number of neurons $n$ in the linear layers. Panel (c): GPT architecture with varying vocabulary size with a fixed embedding dimension. For all plots the reported times are averaged over 10 repetitions and measured on an Nvidia A100 GPU.
  • Figure 3: Results on ImageNet and OGBG. Panels (a-b): validation loss and validation accuracy for the NAdamW (the baseline given by AlgoPerf), K-FAC and Thermodynamic K-FAC (estimated) optimizers as a function of the wall-clock time for training a ViT on ImageNet. Panels (c-d): validation loss and validation mean-average precision (mAP) for the Nesterov (baseline), K-FAC and Thermodynamic K-FAC (estimated) optimizers as a function of the wall-clock time for training a GNN on ogbg-molpcba. For the baselines, the hyperparameters are directly taken from the AlgoPerf benchmark and were tuned for the K-FAC optimizers (see Appendix \ref{['sec:code']}).
  • Figure 4: Effect of quantization on K-FAC training accuracy. Validation accuracy from training a ResNet on image classification with either the Adam optimizer or with the K-FAC optimizer for various levels of precision (integer 6, 8, 12 and 16 bits, and floating-point 32 bits at full precision). The left panel shows the effects of input quantization while the right panel shows the effect of output quantization. The lines correspond to the mean values over 5 runs, while the shaded areas represent one standard deviation away from the mean. Brighter colours indicate higher precision.
  • Figure 5: Circuit diagram of a possible implementation of the thermodynamic solver.