Table of Contents
Fetching ...

Training neural networks with structured noise improves classification and generalization

Marco Benedetti, Enrico Ventura

TL;DR

This work investigates how structured noise during training can improve classification and generalization in attractor neural networks. By reinterpreting Training with Noise (TWN) through the Wong–Sherrington loss $\mathcal L(m,J)$, it shows that maximal noise with specific structure can drive the network toward a SVM-like solution, while unstructured maximal noise converges to Hebbian Unlearning. It proves that Hebbian Unlearning and Training with Noise coincide in the maximal-noise regime when training data are stable fixed points, with a key condition on weights $\omega_i^{\mu}$ ensuring negative contributions near zero stability. The results motivate a two-phase learning scenario resembling sleep-based consolidation, bridging unsupervised and supervised paradigms in neural computation.

Abstract

The beneficial role of noise-injection in learning is a consolidated concept in the field of artificial neural networks, suggesting that even biological systems might take advantage of similar mechanisms to optimize their performance. The training-with-noise algorithm proposed by Gardner and collaborators is an emblematic example of a noise-injection procedure in recurrent networks, which can be used to model biological neural systems. We show how adding structure to noisy training data can substantially improve the algorithm performance, allowing the network to approach perfect retrieval of the memories and wide basins of attraction, even in the scenario of maximal injected noise. We also prove that the so-called Hebbian Unlearning rule coincides with the training-with-noise algorithm when noise is maximal and data are stable fixed points of the network dynamics.

Training neural networks with structured noise improves classification and generalization

TL;DR

This work investigates how structured noise during training can improve classification and generalization in attractor neural networks. By reinterpreting Training with Noise (TWN) through the Wong–Sherrington loss , it shows that maximal noise with specific structure can drive the network toward a SVM-like solution, while unstructured maximal noise converges to Hebbian Unlearning. It proves that Hebbian Unlearning and Training with Noise coincide in the maximal-noise regime when training data are stable fixed points, with a key condition on weights ensuring negative contributions near zero stability. The results motivate a two-phase learning scenario resembling sleep-based consolidation, bridging unsupervised and supervised paradigms in neural computation.

Abstract

The beneficial role of noise-injection in learning is a consolidated concept in the field of artificial neural networks, suggesting that even biological systems might take advantage of similar mechanisms to optimize their performance. The training-with-noise algorithm proposed by Gardner and collaborators is an emblematic example of a noise-injection procedure in recurrent networks, which can be used to model biological neural systems. We show how adding structure to noisy training data can substantially improve the algorithm performance, allowing the network to approach perfect retrieval of the memories and wide basins of attraction, even in the scenario of maximal injected noise. We also prove that the so-called Hebbian Unlearning rule coincides with the training-with-noise algorithm when noise is maximal and data are stable fixed points of the network dynamics.
Paper Structure (20 sections, 50 equations, 10 figures)

This paper contains 20 sections, 50 equations, 10 figures.

Figures (10)

  • Figure 1: Lines in the main plot report the function $\mathcal{L}(m = 0.5, J(m_t))$ for different training overlaps as functions of the number of algorithm steps $d$. The dashed line represents the theoretical minimum value from wong_neural_1993. All measures are averaged over $5$ realizations of the couplings $J$. Choice of the parameters: $N = 100$, $\alpha = 0.3$, $\lambda = 1$, the initial couplings are Gaussian with unitary mean, zero variance and $J_{ii}^{(0)} = 0$$\forall i$.
  • Figure 2: $m_f(1)$ as a function of $m_t$ and $\alpha$. Warmer shades of color are associated to higher retrieval performances. The black dashed line represents the boundary of the retrieval regime according to the criterion in Appendix \ref{['app:c']}, white dots signal the points where basins of attraction to which memories belong are larger than ones obtained from a SVM at $N = 200$.
  • Figure 3: Distribution of $\omega_i^{\mu}$ as a function of $\Delta_i^{\mu}$ for training configurations sampled with a Monte Carlo at temperature $T = 0$ i.e. stable fixed points only (a), $T = 0.5$ (b), $T = 1$ (c), $T = 8$ (d), on a Hebbian network. Warmer colors represent denser region of data points. The full black line is the non-weighted best fit line for the points, the dotted white line represents $\omega = 0$, the red dot is the value of the best fit line associated with $\Delta = 0$. Sub-panels to each panel report a zoom of the line around $\Delta = 0$. Measures have been collected over $15$ samples of the network, and observations show that finite size effects are negligible. Choice of the parameters: $N = 500$, $\alpha = 0.5$.
  • Figure 4: (a), (b): Distribution of $\omega_i^{\mu}$ as a function of $\Delta_i^{\mu}$ for training configurations sampled with a Monte Carlo at temperature $T = 0$ i.e. stable fixed points only (a), and $T = 8$ (b) on a SK model. Warmer colors represent denser region of data points. The full black line is the non-weighted best fit line for the points, the dotted white line represents $\omega = 0$, the red spot is the value of the best fit line associated with $\Delta = 0$. Sub-panels to each panel report a zoom of the line around $\Delta = 0$. (c), (d): Comparison between the Hebbian initialization and the Random one through evaluation of: the Pearson coefficient between $\omega_i^{\mu}$ and $\Delta_i^{\mu}$ (c) and the estimated value of $\omega_{emp}(0)$ from the dispersion plots (d). Measures have been collected over $15$ samples of the network. Choice of the parameters: $N = 500$, $\alpha = 0.5$.
  • Figure 5: The TWN algorithm is implemented by sampling stable fixed points of the network dynamics with $m_t = 0^+$. (a) The empirical measure of $\omega_i^{\mu}$ around $\Delta_i^\mu = 0$ for the case of stable fixed points as a function of the rescaled number of iterations of the learning algorithm. Error bars are given by the standard deviations of the measures. (b) Pearson coefficient measured between $\omega_i^{\mu}$ and $\Delta_i^{\mu}$. (c) The standard deviation of the couplings during learning, defined as $\sigma = \frac{1}{N}\sum\sigma_i$. Points are averaged over $50$ samples and the choice of the parameters is: $N = 100$, $\lambda = 10^{-2}$.
  • ...and 5 more figures