Table of Contents
Fetching ...

Regularization, early-stopping and dreaming: a Hopfield-like setup to address generalization and overfitting

Elena Agliari, Francesco Alemanno, Miriam Aquaro, Alberto Fachechi

TL;DR

The paper reframes Hopfield-like attractor networks as gradient-based learners operating under a regularized loss over the interaction matrix $\boldsymbol{J}$, revealing that optimal couplings are dreaming Hebbian kernels produced by unlearning quantified by the dreaming time $t_d$ (with $t_d=\epsilon_J^{-1}$). The stationary, fully trained solution yields $\boldsymbol{J}^{(D)}$, while unregularized training stopped at $t^*$ reproduces the dreaming kernel, establishing an equivalence between dreaming and early stopping. Analytical results on random datasets, complemented by numerical experiments on structureless and structured data (e.g., MNIST), uncover regimes of failure, generalization, and overfitting, and highlight the role of spurious intra-class states in enabling robust generalization. The proposed mechanism—coalescence of attractors around ground truths yielding wide, class-centered minima—offers a principled way to set regularization/training-time parameters and points to extensions to structured data and a statistical-mechanics interpretation. Overall, the work connects dreaming/unlearning dynamics, regularization, and early stopping to explain and optimize generalization in Hopfield-like networks.

Abstract

In this work we approach attractor neural networks from a machine learning perspective: we look for optimal network parameters by applying a gradient descent over a regularized loss function. Within this framework, the optimal neuron-interaction matrices turn out to be a class of matrices which correspond to Hebbian kernels revised by a reiterated unlearning protocol. Remarkably, the extent of such unlearning is proved to be related to the regularization hyperparameter of the loss function and to the training time. Thus, we can design strategies to avoid overfitting that are formulated in terms of regularization and early-stopping tuning. The generalization capabilities of these attractor networks are also investigated: analytical results are obtained for random synthetic datasets, next, the emerging picture is corroborated by numerical experiments that highlight the existence of several regimes (i.e., overfitting, failure and success) as the dataset parameters are varied.

Regularization, early-stopping and dreaming: a Hopfield-like setup to address generalization and overfitting

TL;DR

The paper reframes Hopfield-like attractor networks as gradient-based learners operating under a regularized loss over the interaction matrix , revealing that optimal couplings are dreaming Hebbian kernels produced by unlearning quantified by the dreaming time (with ). The stationary, fully trained solution yields , while unregularized training stopped at reproduces the dreaming kernel, establishing an equivalence between dreaming and early stopping. Analytical results on random datasets, complemented by numerical experiments on structureless and structured data (e.g., MNIST), uncover regimes of failure, generalization, and overfitting, and highlight the role of spurious intra-class states in enabling robust generalization. The proposed mechanism—coalescence of attractors around ground truths yielding wide, class-centered minima—offers a principled way to set regularization/training-time parameters and points to extensions to structured data and a statistical-mechanics interpretation. Overall, the work connects dreaming/unlearning dynamics, regularization, and early stopping to explain and optimize generalization in Hopfield-like networks.

Abstract

In this work we approach attractor neural networks from a machine learning perspective: we look for optimal network parameters by applying a gradient descent over a regularized loss function. Within this framework, the optimal neuron-interaction matrices turn out to be a class of matrices which correspond to Hebbian kernels revised by a reiterated unlearning protocol. Remarkably, the extent of such unlearning is proved to be related to the regularization hyperparameter of the loss function and to the training time. Thus, we can design strategies to avoid overfitting that are formulated in terms of regularization and early-stopping tuning. The generalization capabilities of these attractor networks are also investigated: analytical results are obtained for random synthetic datasets, next, the emerging picture is corroborated by numerical experiments that highlight the existence of several regimes (i.e., overfitting, failure and success) as the dataset parameters are varied.
Paper Structure (18 sections, 50 equations, 10 figures, 1 table, 2 algorithms)

This paper contains 18 sections, 50 equations, 10 figures, 1 table, 2 algorithms.

Figures (10)

  • Figure 1: Retrieval performance of dreaming kernel versus early-stopping. The three panels show a comparison between the fully-trained solution \ref{['eq:J_sol']} with $\epsilon_J \neq 0$ and the solution of the early-stopped training procedure with $\epsilon_J=0$; for the latter the final training time is chosen according to \ref{['eq:tstar']}. In the leftmost panel, the dataset $\boldsymbol{\xi}$ is made of $P$ Rademacher vectors that naturally display zero mean, while in the central and in the rightmost panels the dataset $\boldsymbol{\xi}$ is made of $P$ items randomly drawn from, respectively, the MNIST and the Fashion-MNIST datasets, and these vectors were pre-processed by Otsu method otsu to make them binary. The items in these datasets were used to build up the interaction matrices $\boldsymbol{J}^{(D)}$ and $\boldsymbol{J}(t^*)$. For the random dataset $N=200$ and $\gamma=1$, whereas for the MNIST and Fashion-MNIST $N=784$ and $\gamma=1$, also, different values of the ratio $P/N$ are considered as reported in the common legend. The performance of the system is measured in terms of the normalized Hamming distance $d(\boldsymbol \xi^{\mu}, \boldsymbol{\sigma}^{(\infty)})$ between the target pattern $\boldsymbol \xi^{\mu}$ and the final configuration $\boldsymbol \sigma^{(\infty)}$, obtained by initializing the system in a corrupted version of $\boldsymbol \xi^{\mu}$ (obtained by flipping the pattern entries with probability $q=0.1$) and iterating \ref{['eq:zeronoise']} up to convergence. By averaging over all the $P$ patterns we obtain $\bar{d}(\boldsymbol{\xi}, \boldsymbol{\sigma}^{(\infty)}) = \frac{1}{P}\sum_{\mu} d(\boldsymbol \xi^{\mu}, \boldsymbol{\sigma}^{(\infty)})$, which is plotted versus the dreaming time. We refer to App. \ref{['metodi']} for further details on numerics.
  • Figure 2: Stopping time as a function of dreaming time. The plot shows the early-stopping time $t^*$ as a function of the dreaming time $t_d$, obtained by a numerical estimate (solid line) from Eq. \ref{['eq:tstar']} and by a fit (dashed line) based on the functional relation $t^*(t_d)=a \log (1+b\, t_d)$, suggested by the analytical findings presented in App. \ref{['sec:approxtstar']}. The network parameters for the three cases are $\gamma=1$, $N=784$, and $P/N=0.2$. The couple of coefficients $(a,b)$ estimated via linear least squares are $(a=0.66, b=0.54), (a=0.11, b=3.25), (a=0.19, b=1.67)$ for the random, MNIST and Fashion-MNIST datasets, respectively.
  • Figure 3: Schematic representation of training points, spurious combinations and ground-pattern. The figure sketches the organization of attracting configurations within each class in the dataset. The class is represented by a ground-pattern $\boldsymbol{\zeta}$ (the red dot in the center), while the training points are located at distance $(1-r)/2$ from it (i.e. they have correlation $r$). Spurious combinations of training points are themselves attracting points, and their correlation $c_L(r)$ increases with the number $L$ of training points involved in the combination. For large enough $M$, the resulting landscape consists in many local minima very close to each other, so that they coalesce and form flat valleys around the ground pattern.
  • Figure 4: Relaxation to fixed points from perturbed training examples and spurious states. The plots show the retrieval capabilities of the model initialized in a configuration consisting in a perturbed version of the training examples (i.e., $L=1$) or in an intra-class spurious configuration $\boldsymbol{\xi}_L$ (as given by Eq. \ref{['eq:spurious']}). The network size is fixed to $N=500$, the number of classes is $K=10$, and the quality of the dataset is $r=0.8$, while different values of dreaming time (from left to right $t_d=0.1,2,10$) and of load (from top to bottom $\alpha=0.4,0.6,0.8$, that is, $M=20,30,40$) are considered. The analysis is performed by taking a reference configurations ${\boldsymbol {\xi}}_L$ (with for $L=1, 3, 5, 20$, as explained by the legend) and applying a perturbation that consists in randomly flipping a fraction $q$ of the entries; preparing the system in this configuration $\boldsymbol{\sigma}^{(0)}$, we update the network up to convergence towards the fixed point $\boldsymbol{\sigma}^{(\infty)}$. Then, we compare the average distances $\bar{d}(\boldsymbol{\xi}_L, \boldsymbol{\sigma}^{(0)})$ and $\bar{d}(\boldsymbol{\xi}_L, \boldsymbol{\sigma}^{(\infty)})$ between the reference configurations and, respectively, the initial and the final configurations. The dashed black lines correspond to the distance between the training examples used to build $\boldsymbol J^{(D)}$ and the associated ground-truths. The results are averaged over 50 different realizations of the dataset.
  • Figure 5: Retrieval on synthetic and structured datasets. The retrieval performance is measured in terms of the normalized Hamming distance $d$ between the final configuration $\boldsymbol \sigma^{(\infty)}$ and the nearest training example $\boldsymbol{\xi}$ (dotted curve, see Eq. \ref{['eq:d_a']}) and the nearest ground-truth $\boldsymbol{\zeta}$ (solid curve, see Eq. \ref{['eq:d_b']}); the results presented have been averaged over the $K\times M$ different initial configurations which constitute the test set (see App. \ref{['metodi']} for further details). The network parameters for the random dataset are $N=200$, $K=10$ and $r=0.8$, whereas for the structured datasets they are $N=784, K=10$. For all the datasets, we reported results for different choices of $\alpha=0.1$, $0.2$ and $0.8$, retaining $\eta=K/N$ fixed, therefore, recalling that $\alpha=K M/N$, we varied $\alpha$ by increasing $M$.
  • ...and 5 more figures