Table of Contents
Fetching ...

SGD as Free Energy Minimization: A Thermodynamic View on Neural Network Training

Ildus Sadrtdinov, Ivan Klimov, Ekaterina Lobacheva, Dmitry Vetrov

TL;DR

This work develops a thermodynamic framework for neural network training by showing that SGD with fixed learning rates minimizes a Helmholtz free energy $F=U-TS$, where $U$ is the expected training loss and $S$ is the entropy of the weight distribution, with an effective temperature $T$ that depends on the LR. The authors validate the framework empirically on underparameterized and overparameterized models, finding that UP yields a monotonically increasing $T( ext{LR})$ and a convex free-energy landscape, while OP exhibits a temperature collapse to zero at small LRs, enabling direct loss minimization. They attribute the UP/OP mismatch to differences in the signal-to-noise ratio of stochastic gradients near optima, supported by a 3D-sphere toy model and neural network experiments. The results offer a novel lens to interpret training dynamics, connecting fixed-LR behavior, implicit regularization, and phase-transition-like phenomena in deep learning.

Abstract

We present a thermodynamic interpretation of the stationary behavior of stochastic gradient descent (SGD) under fixed learning rates (LRs) in neural network training. We show that SGD implicitly minimizes a free energy function $F=U-TS$, balancing training loss $U$ and the entropy of the weights distribution $S$, with temperature $T$ determined by the LR. This perspective offers a new lens on why high LRs prevent training from converging to the loss minima and how different LRs lead to stabilization at different loss levels. We empirically validate the free energy framework on both underparameterized (UP) and overparameterized (OP) models. UP models consistently follow free energy minimization, with temperature increasing monotonically with LR, while for OP models, the temperature effectively drops to zero at low LRs, causing SGD to minimize the loss directly and converge to an optimum. We attribute this mismatch to differences in the signal-to-noise ratio of stochastic gradients near optima, supported by both a toy example and neural network experiments.

SGD as Free Energy Minimization: A Thermodynamic View on Neural Network Training

TL;DR

This work develops a thermodynamic framework for neural network training by showing that SGD with fixed learning rates minimizes a Helmholtz free energy , where is the expected training loss and is the entropy of the weight distribution, with an effective temperature that depends on the LR. The authors validate the framework empirically on underparameterized and overparameterized models, finding that UP yields a monotonically increasing and a convex free-energy landscape, while OP exhibits a temperature collapse to zero at small LRs, enabling direct loss minimization. They attribute the UP/OP mismatch to differences in the signal-to-noise ratio of stochastic gradients near optima, supported by a 3D-sphere toy model and neural network experiments. The results offer a novel lens to interpret training dynamics, connecting fixed-LR behavior, implicit regularization, and phase-transition-like phenomena in deep learning.

Abstract

We present a thermodynamic interpretation of the stationary behavior of stochastic gradient descent (SGD) under fixed learning rates (LRs) in neural network training. We show that SGD implicitly minimizes a free energy function , balancing training loss and the entropy of the weights distribution , with temperature determined by the LR. This perspective offers a new lens on why high LRs prevent training from converging to the loss minima and how different LRs lead to stabilization at different loss levels. We empirically validate the free energy framework on both underparameterized (UP) and overparameterized (OP) models. UP models consistently follow free energy minimization, with temperature increasing monotonically with LR, while for OP models, the temperature effectively drops to zero at low LRs, causing SGD to minimize the loss directly and converge to an optimum. We attribute this mismatch to differences in the signal-to-noise ratio of stochastic gradients near optima, supported by both a toy example and neural network experiments.

Paper Structure

This paper contains 26 sections, 2 theorems, 32 equations, 17 figures, 1 table.

Key Result

Lemma 1

Consider a unit vector $r\in \mathbb{R}^D$, $\|r\|=1$. Then, for the weights lying in the direction of $r$ from the optimum $w^*$, i. e., $w=w^* + \delta r$ for some small $\delta > 0$ we have:

Figures (17)

  • Figure 1: Stationary loss and entropy for UP and OP settings. ConvNet on CIFAR-10.
  • Figure 2: Loss and entropy (left), estimated temperature (center) and free energy for different temperatures (right) vs. LR for UP (top row) and OP (bottom row) ConvNet on CIFAR-10. Orange and blue dashed lines denote the entropy and loss estimates of a uniform distribution on the sphere surface. Black dashed lines show power-law approximation to the temperature. Black stars indicate minima of free energy, which are achieved at LRs, corresponding to temperature values. Standard deviation for loss and entropy is computed over several iterations of the same training run, while translucent filling in the temperature plot indicates confidence intervals.
  • Figure 3: Temperature decay for small LRs of the OP ConvNet on CIFAR-10.
  • Figure 4: Loss surfaces of UP and OP setups and training trajectories of SGD for 3D toy example. Red lines correspond to a low LR ($2.4 \cdot 10^{-3}$), purple --- to a higher LR ($6.9\cdot 10^{-2}$). Black lines indicate great circles.
  • Figure 5: Training metrics for various LRs in UP and OP setups on a 3D sphere. To plot the full loss, we subtract the loss value at the minimum (positive for UP and zero for OP).
  • ...and 12 more figures

Theorems & Definitions (4)

  • Lemma 1
  • proof
  • Theorem 1
  • proof