Table of Contents
Fetching ...

Can Training Dynamics of Scale-Invariant Neural Networks Be Explained by the Thermodynamics of an Ideal Gas?

Ildus Sadrtdinov, Ekaterina Lobacheva, Ivan Klimov, Mikhail I. Katsnelson, Dmitry Vetrov

TL;DR

This work develops a thermodynamic framework linking SGD dynamics for scale-invariant neural networks to thermodynamic concepts such as temperature, pressure, and volume. By deriving SDEs under three training protocols and analyzing an isotropic-noise model, the authors show that stationary SGD distributions follow an ideal-gas-like Gibbs form, with rigorous mappings between hyperparameters and thermodynamic variables. They validate the theory through isotropic toy-model experiments and neural-network trials (e.g., ResNet-18 on CIFAR datasets), confirming predictions for stationary entropy, stationary radius, and Maxwell relations, while highlighting conditions under which the ideal-gas analogy holds. The results offer a principled interpretation of training dynamics and suggest thermodynamics-inspired strategies for hyperparameter tuning and learning rate scheduling, with potential extensions to non-ideal and non-scale-invariant settings.

Abstract

Understanding the training dynamics of deep neural networks remains a major open problem, with physics-inspired approaches offering promising insights. Building on this perspective, we develop a thermodynamic framework to describe the stationary distributions of stochastic gradient descent (SGD) with weight decay for scale-invariant neural networks, a setting that both reflects practical architectures with normalization layers and permits theoretical analysis. We establish analogies between training hyperparameters (e.g., learning rate, weight decay) and thermodynamic variables such as temperature, pressure, and volume. Starting with a simplified isotropic noise model, we uncover a close correspondence between SGD dynamics and ideal gas behavior, validated through theory and simulation. Extending to training of neural networks, we show that key predictions of the framework, including the behavior of stationary entropy, align closely with experimental observations. This framework provides a principled foundation for interpreting training dynamics and may guide future work on hyperparameter tuning and the design of learning rate schedulers.

Can Training Dynamics of Scale-Invariant Neural Networks Be Explained by the Thermodynamics of an Ideal Gas?

TL;DR

This work develops a thermodynamic framework linking SGD dynamics for scale-invariant neural networks to thermodynamic concepts such as temperature, pressure, and volume. By deriving SDEs under three training protocols and analyzing an isotropic-noise model, the authors show that stationary SGD distributions follow an ideal-gas-like Gibbs form, with rigorous mappings between hyperparameters and thermodynamic variables. They validate the theory through isotropic toy-model experiments and neural-network trials (e.g., ResNet-18 on CIFAR datasets), confirming predictions for stationary entropy, stationary radius, and Maxwell relations, while highlighting conditions under which the ideal-gas analogy holds. The results offer a principled interpretation of training dynamics and suggest thermodynamics-inspired strategies for hyperparameter tuning and learning rate scheduling, with potential extensions to non-ideal and non-scale-invariant settings.

Abstract

Understanding the training dynamics of deep neural networks remains a major open problem, with physics-inspired approaches offering promising insights. Building on this perspective, we develop a thermodynamic framework to describe the stationary distributions of stochastic gradient descent (SGD) with weight decay for scale-invariant neural networks, a setting that both reflects practical architectures with normalization layers and permits theoretical analysis. We establish analogies between training hyperparameters (e.g., learning rate, weight decay) and thermodynamic variables such as temperature, pressure, and volume. Starting with a simplified isotropic noise model, we uncover a close correspondence between SGD dynamics and ideal gas behavior, validated through theory and simulation. Extending to training of neural networks, we show that key predictions of the framework, including the behavior of stationary entropy, align closely with experimental observations. This framework provides a principled foundation for interpreting training dynamics and may guide future work on hyperparameter tuning and the design of learning rate schedulers.

Paper Structure

This paper contains 65 sections, 96 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Results for the VMF isotropic noise model with fixed LR $\eta$ and WD $\lambda$. Subfigures a--d: points are numerical measurements, solid lines are theoretical predictions: $U=\frac{d-1}{2}T$, $S(\rho_{\overline{\bm{w}}})=\frac{d-1}{2}\log(2\pi eT)$, $S = S(\rho_{\overline{\bm{w}}}) + (d-1)\log r^*$, and $r^*=\sqrt{\frac{T(d-1)}{p}}$, with $T=\sqrt{\frac{\eta\lambda\sigma^2}{2(d-1)}}$ and $p=\lambda$. Subfigure e: Gibbs energy minimization (\ref{['V2']}). Each subplot corresponds to a fixed pair $(\eta^*, \lambda^*)$, denoted with red circle. The colormap shows the difference between $G$ and its minimum across stationary distributions, with the minimizer marked by a white square. Ideally, red circles coincide with white squares; in practice, they either match or lie very close.
  • Figure 2: Results for ResNet-18 on CIFAR-10 with fixed LR $\eta$ and WD $\lambda$. Subfigures a, b, d: empirically measured $\sigma^2$, mean loss $L$, and temperature $T$ given by $T=\sqrt{\frac{\eta\lambda\sigma^2}{2(d-1)}}$, respectively. Subfigure c: stationary radius $r^*=\sqrt{\frac{T(d-1)}{p}}$ (solid lines, theory) vs. experimental values (points). Subfigures e and f: entropy $S$ as a function of $\eta$ and $\lambda$; solid lines with markers show experimental estimates, dashed lines their smoothed versions.
  • Figure 3: Results for the VMF isotropic noise model on a fixed sphere with radius $r$ and ELR $\eta_{\text{eff}}$. Subfigures a--d: points are numerical measurements, solid lines are theoretical predictions: $U=\frac{d-1}{2}T$, $S(\rho_{\overline{\bm{w}}})=\frac{d-1}{2}\log(2\pi eT)$, $S = S(\rho_{\overline{\bm{w}}}) + (d-1)\log r$, and $\lambda_{\text{eff}}=\frac{T(d-1)}{2V}$, with $T=\frac{\eta_{\text{eff}}\sigma^2}{2}$ and $V=\frac{r^2}{2}$. Subfigure e: Helmholtz energy minimization (\ref{['V2']}). Each subplot corresponds to a radius value $r$. On the horizontal axis, we vary $\eta_{\text{eff}}$ in temperature $T^*$ of Helmholtz energy $F$; on the vertical axis, we consider stationary distributions induced by different $\eta_{\text{eff}}$. The colormap shows the difference between $F$ and its minimum across different stationary distributions (i.e., across each column), with the minimizer marked by a white square. Ideally, white squares coincide with the diagonal; in practice, they either match or lie very close.
  • Figure 4: Results for the VMF isotropic noise model with fixed ELR $\eta_{\text{eff}}$ and WD $\lambda$. Subfigures a--d: points are numerical measurements, solid lines are theoretical predictions: $U=\frac{d-1}{2}T$, $S(\rho_{\overline{\bm{w}}})=\frac{d-1}{2}\log(2\pi eT)$, $S = S(\rho_{\overline{\bm{w}}}) + (d-1)\log r^*$, and $r^*=\sqrt{\frac{T(d-1)}{p}}$, with $T=\frac{\eta_{\text{eff}}\sigma^2}{2}$ and $p=\lambda$. Subfigure e: Gibbs energy minimization (\ref{['V2']}). Each subplot corresponds to a fixed pair $(\eta_{\text{eff}}^*, \lambda^*)$, denoted with red circle. The colormap shows the difference between $G$ and its minimum across stationary distributions, with the minimizer marked by a white square. Ideally, red circles coincide with white squares; in practice, they either match or lie very close.
  • Figure 5: Results for ResNet-18 on CIFAR-10 and CIFAR-100 with fixed LR $\eta$ and WD $\lambda$. Subfigures a, b, d: empirically measured $\sigma^2$, mean loss $L$, and temperature $T$ given by $T=\sqrt{\frac{\eta\lambda\sigma^2}{2(d-1)}}$, respectively. Subfigure c: stationary radius $r^*=\sqrt{\frac{T(d-1)}{p}}$ (solid lines, theory) vs. experimental values (points). Subfigures e and f: entropy $S$ as a function of $\eta$ and $\lambda$; solid lines with markers show experimental estimates, dashed lines their smoothed versions.
  • ...and 8 more figures