Can Training Dynamics of Scale-Invariant Neural Networks Be Explained by the Thermodynamics of an Ideal Gas?
Ildus Sadrtdinov, Ekaterina Lobacheva, Ivan Klimov, Mikhail I. Katsnelson, Dmitry Vetrov
TL;DR
This work develops a thermodynamic framework linking SGD dynamics for scale-invariant neural networks to thermodynamic concepts such as temperature, pressure, and volume. By deriving SDEs under three training protocols and analyzing an isotropic-noise model, the authors show that stationary SGD distributions follow an ideal-gas-like Gibbs form, with rigorous mappings between hyperparameters and thermodynamic variables. They validate the theory through isotropic toy-model experiments and neural-network trials (e.g., ResNet-18 on CIFAR datasets), confirming predictions for stationary entropy, stationary radius, and Maxwell relations, while highlighting conditions under which the ideal-gas analogy holds. The results offer a principled interpretation of training dynamics and suggest thermodynamics-inspired strategies for hyperparameter tuning and learning rate scheduling, with potential extensions to non-ideal and non-scale-invariant settings.
Abstract
Understanding the training dynamics of deep neural networks remains a major open problem, with physics-inspired approaches offering promising insights. Building on this perspective, we develop a thermodynamic framework to describe the stationary distributions of stochastic gradient descent (SGD) with weight decay for scale-invariant neural networks, a setting that both reflects practical architectures with normalization layers and permits theoretical analysis. We establish analogies between training hyperparameters (e.g., learning rate, weight decay) and thermodynamic variables such as temperature, pressure, and volume. Starting with a simplified isotropic noise model, we uncover a close correspondence between SGD dynamics and ideal gas behavior, validated through theory and simulation. Extending to training of neural networks, we show that key predictions of the framework, including the behavior of stationary entropy, align closely with experimental observations. This framework provides a principled foundation for interpreting training dynamics and may guide future work on hyperparameter tuning and the design of learning rate schedulers.
