Fluctuation-dissipation relations for stochastic gradient descent
Sho Yaida
TL;DR
Problem: relate minibatch noise in SGD to parameter dynamics during stationary training. Approach: derive exact, stationarity-based fluctuation-dissipation relations using a discrete-time master-equation framework that accommodates non-Gaussian noise and nonconvex landscapes. Key findings: FDR1 provides a practical equilibration metric and adaptive learning-rate schedule; FDR2 enables probing the loss landscape via the Hessian and anharmonicity. Empirical validation on MNIST and CIFAR-10 confirms the relations and demonstrates the practical utility of adaptive scheduling.
Abstract
The notion of the stationary equilibrium ensemble has played a central role in statistical mechanics. In machine learning as well, training serves as generalized equilibration that drives the probability distribution of model parameters toward stationarity. Here, we derive stationary fluctuation-dissipation relations that link measurable quantities and hyperparameters in the stochastic gradient descent algorithm. These relations hold exactly for any stationary state and can in particular be used to adaptively set training schedule. We can further use the relations to efficiently extract information pertaining to a loss-function landscape such as the magnitudes of its Hessian and anharmonicity. Our claims are empirically verified.
