Scalable Thermodynamic Second-order Optimization
Kaelan Donatella, Samuel Duffield, Denis Melanson, Maxwell Aifer, Phoebe Klett, Rajath Salegame, Zach Belateche, Gavin Crooks, Antonio J. Martinez, Patrick J. Coles
TL;DR
This work tackles the high cost of second-order optimization in neural network training by transforming K-FAC into a thermodynamic-hardware–accelerated algorithm. By using a block-diagonal Kronecker-factor approximation and a thermodynamic linear-algebra framework, per-layer computations can be reduced toward the complexity of first-order methods while preserving natural-gradient benefits. The approach demonstrates linear speedups with network width $n$ and shows robustness to quantization, with empirical results on ViT-ImageNet and GNN-OGBG-molpcba indicating speedups over standard K-FAC and Adam, and quantization experiments suggesting practical low-precision deployment. This work suggests a viable path to scalable, hardware-assisted second-order training for large-scale models in vision and graph domains.
Abstract
Many hardware proposals have aimed to accelerate inference in AI workloads. Less attention has been paid to hardware acceleration of training, despite the enormous societal impact of rapid training of AI models. Physics-based computers, such as thermodynamic computers, offer an efficient means to solve key primitives in AI training algorithms. Optimizers that normally would be computationally out-of-reach (e.g., due to expensive matrix inversions) on digital hardware could be unlocked with physics-based hardware. In this work, we propose a scalable algorithm for employing thermodynamic computers to accelerate a popular second-order optimizer called Kronecker-factored approximate curvature (K-FAC). Our asymptotic complexity analysis predicts increasing advantage with our algorithm as $n$, the number of neurons per layer, increases. Numerical experiments show that even under significant quantization noise, the benefits of second-order optimization can be preserved. Finally, we predict substantial speedups for large-scale vision and graph problems based on realistic hardware characteristics.
