Weight fluctuations in (deep) linear neural networks and a derivation of the inverse-variance flatness relation
Markus Gross, Arne P. Raulf, Christoph Räth
TL;DR
This work analyzes weight fluctuations in underparameterized linear neural networks trained by SGD, using a continuum Langevin framework to understand how stochasticity shapes late-time training. It first shows that in a single-layer net the SGD noise spectrum deviates from the Hessian due to broken detailed balance, producing anisotropic weight fluctuations while the loss remains effectively isotropic. It then extends to a two-layer linear network, deriving a layerwise stochastic gradient flow and performing a perturbative linearization around a quasi-stationary state; the inter-layer coupling and rank deficiencies of the drift-diffusion operators give rise to anisotropic fluctuations and enable an analytic derivation of the inverse variance-flatness relation (IVFR) for fluctuations in the second layer. The results connect microscopic SGD noise structure to macroscopic loss landscape geometry, offering theoretical insight into why broader, flatter valleys correlate with stable training and generalization, and suggesting a principled route toward Bayesian sampling approaches in neural networks.
Abstract
We investigate the stationary (late-time) training regime of single- and two-layer underparameterized linear neural networks within the continuum limit of stochastic gradient descent (SGD) for synthetic Gaussian data. In the case of a single-layer network in the weakly underparameterized regime, the spectrum of the noise covariance matrix deviates notably from the Hessian, which can be attributed to the broken detailed balance of SGD dynamics. The weight fluctuations are in this case generally anisotropic, but effectively experience an isotropic loss. For an underparameterized two-layer network, we describe the stochastic dynamics of the weights in each layer and analyze the associated stationary covariances. We identify the inter-layer coupling as a distinct source of anisotropy for the weight fluctuations. In contrast to the single-layer case, the weight fluctuations are effectively subject to an anisotropic loss, the flatness of which is inversely related to the fluctuation variance. We thereby provide an analytical derivation of the recently observed inverse variance-flatness relation in a model of a deep linear neural network.
