Table of Contents
Fetching ...

Weight fluctuations in (deep) linear neural networks and a derivation of the inverse-variance flatness relation

Markus Gross, Arne P. Raulf, Christoph Räth

TL;DR

This work analyzes weight fluctuations in underparameterized linear neural networks trained by SGD, using a continuum Langevin framework to understand how stochasticity shapes late-time training. It first shows that in a single-layer net the SGD noise spectrum deviates from the Hessian due to broken detailed balance, producing anisotropic weight fluctuations while the loss remains effectively isotropic. It then extends to a two-layer linear network, deriving a layerwise stochastic gradient flow and performing a perturbative linearization around a quasi-stationary state; the inter-layer coupling and rank deficiencies of the drift-diffusion operators give rise to anisotropic fluctuations and enable an analytic derivation of the inverse variance-flatness relation (IVFR) for fluctuations in the second layer. The results connect microscopic SGD noise structure to macroscopic loss landscape geometry, offering theoretical insight into why broader, flatter valleys correlate with stable training and generalization, and suggesting a principled route toward Bayesian sampling approaches in neural networks.

Abstract

We investigate the stationary (late-time) training regime of single- and two-layer underparameterized linear neural networks within the continuum limit of stochastic gradient descent (SGD) for synthetic Gaussian data. In the case of a single-layer network in the weakly underparameterized regime, the spectrum of the noise covariance matrix deviates notably from the Hessian, which can be attributed to the broken detailed balance of SGD dynamics. The weight fluctuations are in this case generally anisotropic, but effectively experience an isotropic loss. For an underparameterized two-layer network, we describe the stochastic dynamics of the weights in each layer and analyze the associated stationary covariances. We identify the inter-layer coupling as a distinct source of anisotropy for the weight fluctuations. In contrast to the single-layer case, the weight fluctuations are effectively subject to an anisotropic loss, the flatness of which is inversely related to the fluctuation variance. We thereby provide an analytical derivation of the recently observed inverse variance-flatness relation in a model of a deep linear neural network.

Weight fluctuations in (deep) linear neural networks and a derivation of the inverse-variance flatness relation

TL;DR

This work analyzes weight fluctuations in underparameterized linear neural networks trained by SGD, using a continuum Langevin framework to understand how stochasticity shapes late-time training. It first shows that in a single-layer net the SGD noise spectrum deviates from the Hessian due to broken detailed balance, producing anisotropic weight fluctuations while the loss remains effectively isotropic. It then extends to a two-layer linear network, deriving a layerwise stochastic gradient flow and performing a perturbative linearization around a quasi-stationary state; the inter-layer coupling and rank deficiencies of the drift-diffusion operators give rise to anisotropic fluctuations and enable an analytic derivation of the inverse variance-flatness relation (IVFR) for fluctuations in the second layer. The results connect microscopic SGD noise structure to macroscopic loss landscape geometry, offering theoretical insight into why broader, flatter valleys correlate with stable training and generalization, and suggesting a principled route toward Bayesian sampling approaches in neural networks.

Abstract

We investigate the stationary (late-time) training regime of single- and two-layer underparameterized linear neural networks within the continuum limit of stochastic gradient descent (SGD) for synthetic Gaussian data. In the case of a single-layer network in the weakly underparameterized regime, the spectrum of the noise covariance matrix deviates notably from the Hessian, which can be attributed to the broken detailed balance of SGD dynamics. The weight fluctuations are in this case generally anisotropic, but effectively experience an isotropic loss. For an underparameterized two-layer network, we describe the stochastic dynamics of the weights in each layer and analyze the associated stationary covariances. We identify the inter-layer coupling as a distinct source of anisotropy for the weight fluctuations. In contrast to the single-layer case, the weight fluctuations are effectively subject to an anisotropic loss, the flatness of which is inversely related to the fluctuation variance. We thereby provide an analytical derivation of the recently observed inverse variance-flatness relation in a model of a deep linear neural network.
Paper Structure (31 sections, 91 equations, 9 figures, 1 table)

This paper contains 31 sections, 91 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Single-layer linear neural network model. The student network (left) transforms the input $\mathbf{x}$ to the output $\hat{y}$ using its learned weights $\mathbf{w}$, while the teacher network (right) generates the targets ("labels") $y$ using fixed random weights $\mathbf{u}$ and noise $\epsilon$ (see text).
  • Figure 2: Deviations from the Hessian approximation of the noise covariance of SGD of a linear network [\ref{['eq_LR_noise_cov']}]. (a) Variance of the matrix elements of $\mathcal{P}=X^+X$, which enters in \ref{['eq_noise_correct']}, as a function of the sample number $P$ rescaled by the input dimension $N$. Curves for various $N$ and $P$ superimpose when expressed in appropriately scaled coordinates. (b) Entries, ordered by magnitude, of the diagonal random matrix $K$ [\ref{['eq_noise_correct']}], which determines the covariance of the SGD noise in \ref{['eq_LR_noise_cov']}. The various solid curves correspond to sample numbers ranging between $P=70$ and $P=10000$ (from bottom to top) with fixed input dimension $N=50$, averaged over 100 realizations of $K$. The dotted black curve represents the approximation $K^{\mu\mu} = (\boldsymbol\epsilon^\mu)^2$, where $\boldsymbol\epsilon^\mu$ is a vector of independent Gaussian random numbers of zero mean and variance $\sigma_\epsilon^2$. The scaling of $\mathrm{var}(\mathcal{P})\propto N^{-1}s^{-1.8}$ observed in (a) ensures that the elements of the diagonal matrix $K$, which characterizes the deviation of the noise from the Hessian [see \ref{['eq_LR_noise_cov']}], are Gaussian i.i.d. for $s\to\infty$.
  • Figure 3: Gradient noise of SGD for a linear network [see \ref{['eq_LR_noise_cov_gen']}]: anisotropy, dependence on sample number, and deviations from the Hessian approximation. (a) Eigenvalues of the gradient noise covariance matrix $C$, ordered by their magnitude, for varying number of samples $P$ (increasing from bottom to top) in the underparameterized regime and at input dimension $N=100$. Numerical results obtained from SGD (single data realization with learning rate $\lambda=0.1$, mini-batch size $S=10$, label noise $\sigma_\epsilon^2=10^{-2}$; solid lines) are compared to the theoretical predictions obtained from \ref{['eq_LR_noise_cov']} (averaged over multiple data realizations). (b) Validity of Hessian approximation of the noise for $P/N\gg 1$: Eigenvalues of $C$ given by \ref{['eq_LR_noise_cov']} (solid lines) compared to those of \ref{['eq_noise_cov_Hess']} (broken lines, essentially corresponding to the eigenvalues of the Hessian $H$), for $P/N=1.1,2,50$ (from bottom to top right) at $N=100$. The inset shows the maximum and minimum eigenvalues of $C$ from \ref{['eq_LR_noise_cov']} (solid connecting lines) and \ref{['eq_noise_cov_Hess']} (dashed connecting lines) for varying $P$. Eigenvalues in (a,b) are normalized by the factor $\sigma_x^2 \sigma_\epsilon^2/S$, where $\sigma_x^2=1/N$ is the variance of the samples [see also \ref{['eq_noise_cov_Hess']}]. (c) Visualization of the noise covariance (for $N/P\simeq 0.67$) in the basis of the Hessian $H$ [\ref{['eq_cov_mats']}], i.e., $\tilde{C} = \langle |V^T C V|\rangle_{x,\epsilon}$, where the average is over several data realizations and the matrix $V$ consists of the eigenvectors of $H$ as columns. The color range is scaled logarithmically and normalized to the maximum entry of $\tilde{C}$. The magnitude of the off-diagonal entries decrease upon increasing $P$.
  • Figure 4: Exemplary relaxation behavior of the loss $L(\mathbf{z})$ (dash-dotted curve) and the weight modes $z_k(t)=V^T_{kj} w_j$ [for $k=1,5,25$, see \ref{['eq_SGD_Langevin_transf']}] towards the solution $z_k^*$ (single run). Modes obtained from SGD experiments (solid lines) are compared to theoretical predictions $\langle z_k(t)\rangle$ (\ref{['eq_LR_modes_dyn']}, averaged over the noise; dashed), for a slightly underparameterized linear net with $P/N\simeq 1.04$. The inset shows the loss and the mode $|z_0(t)|$ in double-logarithmic representation. The gray line in the inset represents the theoretical prediction $L^*$ for the final loss. Parameters used for SGD experiments: learning rate $\lambda=0.1$, label noise $\sigma_\epsilon^2=10^{-2}$, input dimension $N=50$, sample number $P=52$, batch size $S=10$.
  • Figure 5: Statistics of weight fluctuations for a linear network in the stationary state. (a) Behavior of the weight variance as a function of input dimension for $P=200$ samples. Simulation results (connected symbols) obtained for a single run of SGD with learning rate $\lambda=0.1$, label noise $\sigma_\epsilon^2=0.25$, and mini-batch size $S=10$ are compared to the theoretical predictions of \ref{['eq_wts_cov_conv', 'eq_wts_cov']} for $\langle\langle\delta w_i(t)^2\rangle_t\rangle_i = N^{-1}\sum_i M_{ii}$ (dotted and dash-dotted lines), which represents the temporal variance of the weights in the stationary state [where $\mathbf{w}(t\to\infty)=\mathbf{w}^*$, see \ref{['eq_LR_sol']}] averaged over all weights. The quantity $\langle\delta w_i(t\to\infty)^2\rangle_i$ (filled circles and dashed line) represents the variance across all weights at an arbitrary but fixed time in the stationary state. The train and test losses are also shown for comparison. (b) Eigenvalues of the covariance matrix of the weights $M=\langle\delta\mathbf{w} \delta\mathbf{w}^T\rangle$, ordered by their magnitude and normalized by $\lambda\sigma_\epsilon^2/(2S)$ [see \ref{['eq_wts_cov_conv']}], for varying number of samples ($P/N = 1.05, 1.1, 1.2, 1.5, 2.0, 4.0,10$, from bottom to top) in the underparameterized regime (for $\lambda=0.1$, $S=10$, and input dimension $N=100$). Numerical results obtained from SGD (solid lines) are compared to the theoretical predictions obtained from \ref{['eq_wts_cov']} (dashed lines; averaged over the multiple data realizations). The straight (dashed-dotted) black line represents the prediction of \ref{['eq_wts_cov_conv']}, while the dotted lines represent the theoretical eigenvalues under the detailed balance assumption [\ref{['eq_wts_cov_detbal']}]. (c) Smallest and largest eigenvalues of $M$ as given by the exact result \ref{['eq_wts_cov']} (solid lines) and the detailed balance approximation \ref{['eq_wts_cov_detbal']} (dotted lines), normalized by $\lambda\sigma_\epsilon^2/(2S)$ and for varying sample numbers $P$. The dashed lines represent the smallest and largest absolute values of the eigenvalues of $Q$ [\ref{['eq_wts_cov']}] in the same normalization. We note that the impact of Q diminishes in the oversampled regime.
  • ...and 4 more figures