Table of Contents
Fetching ...

Controlled Langevin Dynamics for Sampling of Feedforward Neural Networks Trained with Minibatches

Alessandro Zambon, Francesca Caruso, Riccardo Zecchina, Guido Tiana

Abstract

Sampling the parameter space of artificial neural networks according to a Boltzmann distribution provides insight into the geometry of low-loss solutions and offers an alternative to conventional loss minimization for training. However, exact sampling methods such as hybrid Monte Carlo (hMC), while formally correct, become computationally prohibitive for realistic datasets because they require repeated evaluation of full-batch gradients. We introduce a pseudo-Langevin (pL) dynamics that enables efficient Boltzmann sampling of feed-forward neural networks trained with large datasets by using minibatches in a controlled manner. The method exploits the statistical properties of minibatch gradient noise and adjusts fictitious masses and friction coefficients to ensure that the induced stochastic process samples efficiently the desired equilibrium distribution. We validate numerically the approach by comparing its equilibrium statistics with those obtained from exact hMC sampling. Performance benchmarks demonstrate that, while hMC rapidly becomes inefficient as network size increases, the pL scheme maintains high computational diffusion and scales favorably to networks with over one million parameters. Finally, we show that sampling at intermediate temperatures yields optimal generalization performance, comparable to SGD, without requiring a validation set or early stopping procedure. These results establish controlled minibatch Langevin dynamics as a practical and scalable tool for exploring and exploiting the solution space of large neural networks.

Controlled Langevin Dynamics for Sampling of Feedforward Neural Networks Trained with Minibatches

Abstract

Sampling the parameter space of artificial neural networks according to a Boltzmann distribution provides insight into the geometry of low-loss solutions and offers an alternative to conventional loss minimization for training. However, exact sampling methods such as hybrid Monte Carlo (hMC), while formally correct, become computationally prohibitive for realistic datasets because they require repeated evaluation of full-batch gradients. We introduce a pseudo-Langevin (pL) dynamics that enables efficient Boltzmann sampling of feed-forward neural networks trained with large datasets by using minibatches in a controlled manner. The method exploits the statistical properties of minibatch gradient noise and adjusts fictitious masses and friction coefficients to ensure that the induced stochastic process samples efficiently the desired equilibrium distribution. We validate numerically the approach by comparing its equilibrium statistics with those obtained from exact hMC sampling. Performance benchmarks demonstrate that, while hMC rapidly becomes inefficient as network size increases, the pL scheme maintains high computational diffusion and scales favorably to networks with over one million parameters. Finally, we show that sampling at intermediate temperatures yields optimal generalization performance, comparable to SGD, without requiring a validation set or early stopping procedure. These results establish controlled minibatch Langevin dynamics as a practical and scalable tool for exploring and exploiting the solution space of large neural networks.
Paper Structure (11 sections, 30 equations, 11 figures, 1 table)

This paper contains 11 sections, 30 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Sketch of the feed--forward ANN used in this work, with (from left to right) $L_1$, $L_2$ and $L_3$ neurons for the first, second and third layer respectively. On the left and on the right of the NN, a possible couple of input and output vectors respectively, that is a $d$-dimensional spin vector and a normalized probability distribution on the $K=L_3$ classes.
  • Figure 2: Scheme of the pseudo--Langevin (pL) algorithm.
  • Figure 3: a) The distribution $\rho$ of the components of the mini-batch noise $\boldsymbol{\mathcal{R}}^{(b)}_{\tau} \left(\boldsymbol{w}(t)\right)$ during a pL simulation as a function of time $t$ (blue curves), compared to expected distribution $\mathcal{N}(0,1)$ (orange curves). The gray planes indicate the times at which the values of the mini-batch noise variance matrix $\mathcal{V}_{\tau}$ are updated. b) The distribution $\rho$ of the elements of the correlation matrix $C_{ij}$ during a pL simulation as a function of time $t$, computed among the components of the mini-batch noise $\boldsymbol{\mathcal{R}}^{(b)}_{\tau} \left(\boldsymbol{w}(t)\right)$ (blue curves), of the white noise $\boldsymbol{\mathcal{R}}(t)$ (orange curves) and of the weighted sum of the two (green curves). The curves are slightly shifted along the z-axis to make them distinguishable. c) The average of the component--wise noise autocorrelation $\overline{\chi}(\Delta t)$ for the mini-batch noise (upper panel, blue curves), for the white noise (middle panel, orange curves) and for their weighted sum (lower panel, green curves), computed for small time intervals during a pL simulation parametrized by the time $t$. d) The average of the mean cross-entropy $\mathcal{L}$ (upper plot), the squared norm $|\boldsymbol{w}|^2$ (center plot) and the training error $\epsilon$ (lower plot) as a function of the temperature $T$ obtained from hMC (blue curves) and pL (orange curves) for the smallest model $N=11160$. The relative error between the estimates is also reported (gray dotted line).
  • Figure 4: In the upper plot, the diffusion coefficient $D^*$ scaled by the network dimension $N$ as a function of $N$, for the three studied sizes $N=11160$, $N=101610$ and $N=1006110$ and for both sampling methods hMC (blue curve) and pL (orange curve). In the lower plots, the mean squared distance $\overline{d^{2}}$ scaled by the network dimension $N$ as a function of wall-clock time interval $\Delta t_W$ between sampled vectors, for the three studied sizes $N=11160$, $N=101610$ and $N=1006110$ and for both sampling methods hMC (blue curve) and pL (orange curve). The mean squared distances are obtained by averaging the values from independent simulations starting from different equilibrated weight vectors.
  • Figure 5: The mean value and the standard deviation of the generalization error $\epsilon_{g}$(a), the mean cross-entropy function $\mathcal{L}$(b) and the average of all the variances of the mini--batch gradient components in the first two layers $\overline{\mathcal{V}_\tau}$(c) sampled at equilibrium at different temperatures $T$ using the pL scheme. In the inserted plot, the generalization error $\epsilon_{g}$ as function of the wall-clock time $t_W$ during two simulations starting from initialized models at two different temperatures, $T=1.0\cdot10^{-7}$ (orange curve) and $T=3.0\cdot10^{-7}$ (green curve), and during an Adam training beyond early--stopping (red dotted curve). The gray straight lines reported in the upper plot and in the inserted one represent the best mean generalization error found with Adam training. All values of $\epsilon_g$ have been computed on the same test dataset.
  • ...and 6 more figures