Table of Contents
Fetching ...

Thermodynamic Natural Gradient Descent

Kaelan Donatella, Samuel Duffield, Maxwell Aifer, Denis Melanson, Gavin Crooks, Patrick J. Coles

TL;DR

Thermodynamic Natural Gradient Descent (TNGD) proposes a hybrid digital-analog optimizer that uses a stochastic processing unit to solve for the natural gradient $\tilde{g}_k \approx F_{k-1}^{-1} \nabla \ell_{k-1}$ via Ornstein–Uhlenbeck dynamics, yielding updates $\theta_{k+1} = \theta_k - \eta \tilde{g}_k$ with per-iteration costs approaching first-order methods. By offloading curvature computations to an analog thermodynamic computer, TNGD achieves near-linear scaling in the number of parameters and can interpolate between SGD and NGD through the analog-runtime parameter $t$, while maintaining convergence in mean for positive-definite $F$. The authors demonstrate competitive performance on MNIST and language-model fine-tuning tasks, notably showing speedups over state-of-the-art digital optimizers and robustness to hardware-noise; a hybrid variant (TNGD-Adam) can outperform Adam in QA tasks. Overall, this work illustrates the potential of co-designing optimization algorithms with specialized hardware to realize the benefits of second-order methods at practical scales.

Abstract

Second-order training methods have better convergence properties than gradient descent but are rarely used in practice for large-scale training due to their computational overhead. This can be viewed as a hardware limitation (imposed by digital computers). Here we show that natural gradient descent (NGD), a second-order method, can have a similar computational complexity per iteration to a first-order method, when employing appropriate hardware. We present a new hybrid digital-analog algorithm for training neural networks that is equivalent to NGD in a certain parameter regime but avoids prohibitively costly linear system solves. Our algorithm exploits the thermodynamic properties of an analog system at equilibrium, and hence requires an analog thermodynamic computer. The training occurs in a hybrid digital-analog loop, where the gradient and Fisher information matrix (or any other positive semi-definite curvature matrix) are calculated at given time intervals while the analog dynamics take place. We numerically demonstrate the superiority of this approach over state-of-the-art digital first- and second-order training methods on classification tasks and language model fine-tuning tasks.

Thermodynamic Natural Gradient Descent

TL;DR

Thermodynamic Natural Gradient Descent (TNGD) proposes a hybrid digital-analog optimizer that uses a stochastic processing unit to solve for the natural gradient via Ornstein–Uhlenbeck dynamics, yielding updates with per-iteration costs approaching first-order methods. By offloading curvature computations to an analog thermodynamic computer, TNGD achieves near-linear scaling in the number of parameters and can interpolate between SGD and NGD through the analog-runtime parameter , while maintaining convergence in mean for positive-definite . The authors demonstrate competitive performance on MNIST and language-model fine-tuning tasks, notably showing speedups over state-of-the-art digital optimizers and robustness to hardware-noise; a hybrid variant (TNGD-Adam) can outperform Adam in QA tasks. Overall, this work illustrates the potential of co-designing optimization algorithms with specialized hardware to realize the benefits of second-order methods at practical scales.

Abstract

Second-order training methods have better convergence properties than gradient descent but are rarely used in practice for large-scale training due to their computational overhead. This can be viewed as a hardware limitation (imposed by digital computers). Here we show that natural gradient descent (NGD), a second-order method, can have a similar computational complexity per iteration to a first-order method, when employing appropriate hardware. We present a new hybrid digital-analog algorithm for training neural networks that is equivalent to NGD in a certain parameter regime but avoids prohibitively costly linear system solves. Our algorithm exploits the thermodynamic properties of an analog system at equilibrium, and hence requires an analog thermodynamic computer. The training occurs in a hybrid digital-analog loop, where the gradient and Fisher information matrix (or any other positive semi-definite curvature matrix) are calculated at given time intervals while the analog dynamics take place. We numerically demonstrate the superiority of this approach over state-of-the-art digital first- and second-order training methods on classification tasks and language model fine-tuning tasks.
Paper Structure (20 sections, 24 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 20 sections, 24 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: Overview of Thermodynamic Natural Gradient Descent (TNGD). A GPU that stores the model architecture and provides the gradient $\nabla \ell _k$ and Fisher matrix $F_k$ (through its representation given by the Jacobian $J_f$ and Hessian $H_L$ matrices given by Eq. \ref{['eq:ggn']}) at step $k$ is connected to a thermodynamic computer, called the stochastic processing unit (SPU). At times $t_{k}$, the estimate of the natural gradient $\tilde{g}_{k}$ is sent to the GPU, which updates the parameters of the model and calculates gradients and curvature matrices for some new data batch $(x_{k}, y_{k})$. During digital auto-differentiation, the SPU undergoes dynamical evolution, either continuing to approach its steady-state or remaining in it. After some time, gradient $\nabla \ell _{k}$ and Fisher matrix $F_{k}$ are sent to the SPU through a DAC and digital controllers. This modifies the dynamics of the SPU, and after some time interval, a new natural gradient estimate $\tilde{g}_{k+1}$ is sent back to the GPU. Note that the time between two measurements $t_{k+1} - t_{k}$ need not be greater than the time between two auto-differentiation calls. The hybrid digital-thermodynamic process may be used asynchronously as shown in the diagram (where the time of measurement of $\tilde{g}$ and upload of the gradient and Fisher matrix are not the same).
  • Figure 2: Runtime per iteration of second-order optimizers considered in this paper. (a) The runtimes per iteration are compared for NGD, NGD-CG, NGD-Woodbury, and TNGD (estimated) for various $N$. Here the convolutional network we applied to MNIST is used and the dimension of the hidden layer is varied to vary $N$ for fixed $d_z = 20$. (b) The same comparison is shown for various values of $d_z$. The same network is used and $d_z$ is varied (this also has the effect of varying the $N$). Error bars are displayed as shaded area but are smaller than the data markers.
  • Figure 3: Performance comparison of Adam and TNGD (simulated) on MNIST classification. (a) Training (dashed lines) and test loss (solid lines) for Adam (darker colors) and TNGD (lighter colors) are plotted against runtime (measured for Adam, and estimated for TNGD from the timing model described in Section \ref{['section:cc_and_perf']}). Shaded areas are standard deviations over five random seeds. Note that Adam includes adaptive averaging of first and second moment estimates with $(\beta_1, \beta_2) = (0.9, 0.999)$, while TNGD does not. (b) $1 - \mathrm{Accuracy}$ for training and test sets.
  • Figure 4: Training loss vs. iterations for varying analog dynamics times. (a) The training loss is shown for NGD (dashed line) and for TNGD with various analog dynamics times $t$ (solid lines). (b) The training loss is shown for NGD (dashed line) and for TNGD with fixed analog dynamics time $t = 5\tau$ and varying delay times $t_d$ (solid lines). The delay appears to have a momentum effect, which can even lead to TNGD outperforming exact NGD for certain analog dynamics and delay times. Shaded areas are standard deviations over five random seeds.
  • Figure 5: Training loss vs. iterations for QA fine-tuning. (a) Comparison of the performance per iteration of TNGD, Adam, and TNGD-Adam, where the latter uses the natural gradient estimate in conjunction with the Adam update rule with $(\beta_1, \beta_2) = (0,0)$. (b) Performance of the TNGD-Adam optimizer for various analog dynamics times. Similar to Fig. \ref{['fig:mnist_analog']}, the performance improves as $t$ grows.
  • ...and 2 more figures