Table of Contents
Fetching ...

The importance of being empty: a spectral approach to Hopfield neural networks with diluted examples

Elena Agliari, Alberto Fachechi, Domenico Luongo

TL;DR

It is demonstrated that the Hebbian matrix, built on sparse examples, can be recovered as the fixed point of a gradient descent algorithm with dropout, over a suitable loss function.

Abstract

We consider Hopfield networks, where neurons interact pair-wise by Hebbian couplings built over $i$. a set of definite patterns (ground truths), $ii$. a sample of labeled examples (supervised setting), $iii$. a sample of unlabeled examples (unsupervised setting). We focus on the case where ground-truths are Rademacher vectors and examples are noisy versions of these ground-truths, possibly displaying some blank entries (e.g., mimicking missing or dropped data), and we determine the spectral distribution of the coupling matrices in the three scenarios, by exploiting and extending the Marchenko-Pastur theorem. By levering this knowledge, we are able to analytically inspect the stability and attractiveness of the ground truths, as well as the generalization capabilities of the networks. In particular, as corroborated by long-running Monte Carlo simulations, the presence of black entries can have benefits in some specific conditions, suggesting strategies based on data sparsification; the robustness of these results in structured datasets is confirmed numerically. Finally, we demonstrate that the Hebbian matrix, built on sparse examples, can be recovered as the fixed point of a gradient descent algorithm with dropout, over a suitable loss function.

The importance of being empty: a spectral approach to Hopfield neural networks with diluted examples

TL;DR

It is demonstrated that the Hebbian matrix, built on sparse examples, can be recovered as the fixed point of a gradient descent algorithm with dropout, over a suitable loss function.

Abstract

We consider Hopfield networks, where neurons interact pair-wise by Hebbian couplings built over . a set of definite patterns (ground truths), . a sample of labeled examples (supervised setting), . a sample of unlabeled examples (unsupervised setting). We focus on the case where ground-truths are Rademacher vectors and examples are noisy versions of these ground-truths, possibly displaying some blank entries (e.g., mimicking missing or dropped data), and we determine the spectral distribution of the coupling matrices in the three scenarios, by exploiting and extending the Marchenko-Pastur theorem. By levering this knowledge, we are able to analytically inspect the stability and attractiveness of the ground truths, as well as the generalization capabilities of the networks. In particular, as corroborated by long-running Monte Carlo simulations, the presence of black entries can have benefits in some specific conditions, suggesting strategies based on data sparsification; the robustness of these results in structured datasets is confirmed numerically. Finally, we demonstrate that the Hebbian matrix, built on sparse examples, can be recovered as the fixed point of a gradient descent algorithm with dropout, over a suitable loss function.

Paper Structure

This paper contains 25 sections, 3 theorems, 91 equations, 22 figures.

Key Result

Proposition 1

Let $\boldsymbol{J}$ be a coupling matrix, and $\mu_{\boldsymbol{J}}$ the empirical spectral distribution, namely Let $MP(\alpha,\alpha\sigma,s)$ being the modified Marchenko-Pastur distribution with associated probability measure with and $\lambda_\pm = \sigma (1\pm \sqrt\alpha)^2+s$ Then, in the thermodynamic limit $N \to \infty$:

Figures (22)

  • Figure 1: Empirical distributions for the local generalization in the unsupervised setting. The plots show the empirical distributions for the 1-step local generalization in the unsupervised setting for $K=100$ and $M=50$ (first row) and $K=400$ and $M=500$ (second row), with the quality fixed to $r=0.9$. The values of the dilutions are $d=0$ (first column), $d=0.2$ (second column) and $d=0.5$ (third column). Specifically, the main plots report the histograms of the 1-step local generalization, including or removing the diagonal from the coupling matrix (resp. the yellow and blue histograms). The blue and red dotted curves are Gaussian distributions whose parameters are resp. fitted to the data (blue curve) or estimated by the theoretical predictions with the spectral theory as given by Prop. \ref{['attractiveness_prop']}. The insets finally report a similar comparison performed over the (again empirical, fitted and theoretical) cumulative distribution functions. In this case, the setting without the diagonal is not reported.
  • Figure 2: Generalization in the supervised and unsupervised setting. The plots show a comparison between the numerical estimates (dots) for the quantity $m^{(1)}(\boldsymbol \xi, r)$ at a dilution level $d$ and the theoretical expressions (solid curves) for the best-fitting value $d^*$ -- obtained comparing the predictions in Prop. \ref{['attractiveness_prop']} and the numerical findings -- in the supervised (first row) and unsupervised (second row) setting. For the sake of readability, in the lower right plot we reported only the best-fitting ($R^2\approx 1$) values $d^*$ of the dilution parameters, which always refer (from top to bottom) to $d=0,0.1,0.2,0.3$ in the numerical simulations. Numerical results are averaged over $100$ different realizations fixing $N=500, M=100$. We compare results for $\alpha=0.1$ (left) and $\alpha=0.2$ (right). For the latter, we also report a zoom on high values of the dataset quality ($r\simeq 1$, inset plot). Again, error bars are not reported, due to the low magnitude of the relative errors of the numerical simulations.
  • Figure 3: A resuming picture of the capacity of generalization in unsupervised setting. In the first row, we report the behavior of the 1-step Mattis magnetization as a function of the dilution parameter $d$ for various values of $r$ and $\alpha=0.1$ (left) and $\alpha=0.4$ (right). The inset in the upper right plot highlights the behavior of $\tilde{d}$ (i.e., the values of the dilution parameter at which, for given $r$, the 1-step magnetization develops a maximum) as a function of the dataset quality, for different values of the load $\alpha$ (going from right side to the left side, $\alpha=0.1,0.2,0.3,0.4,0.5,0.6$). In the second row, we report the heat maps for the probability that the attractiveness of the ground-truth is positive, namely $\mathcal{P}(\Delta\ge 0) = \frac{1}{2} (1+m^{(1)})$ -- see also App. \ref{['app:snr']}, again for $\alpha=0.1$ (left) and $\alpha=0.4$. In the lower right plot, the inset reports a zoom to the portion of parameters plane $(r,d)\in (0.9\div 1,0\div 1)$ for highlighting the non-monotonous behavior of the 1-step magnetization as a function of $d$. In all the plots, the number of examples per class is fixed to $M=100$.
  • Figure 4: Pattern stability as a function of the dilution in the unsupervised setting. The plots report the behavior of the final magnetization $m_f$ of the neural dynamics \ref{['eq:dynamics']} with the initial condition being the one of the ground-truth ($\boldsymbol{\sigma}^{(0)}=\boldsymbol{\xi}^\mu$). We report the dependence of $m_f$ on the dilution parameter $d$ for various values of $K$ ($50$, left; $250$, center; $500$, right), the dataset quality $r$ ($0.6$, first row; $0.8$, second row; $0.9$, third row), and the number of examples per class $M$ ($10$, blue square dots; $20$, yellow triangles; $200$, green circles). The network size is fixed to $N=1000$. The results are averaged over 500 different realizations of the couplings matrix for each point.
  • Figure 5: Generalization in the unsupervised setting. The plots report the results of the pattern attractiveness, namely the final magnetization $m_f$ under the relaxation to fixed points of the neural dynamics \ref{['eq:dynamics']} starting from a validation example (with initial magnetization $m_0 (r)$). We report the dependence of $m_f$ on the initial overlap $m_0 (r)$ for various values of $K$ ($50$, first row; $200$, second row; $250$, third row), the number of examples per class $M$ ($10$, left; $50$, center; $200$, right), and the dilution parameter $d$ ($0$, blue squares; $0.5$, yellow ones). The network size is fixed to $N=1000$. The results are averaged over 500 different realizations of the couplings matrix for each point.
  • ...and 17 more figures

Theorems & Definitions (15)

  • Definition 1: Gaussian approximation
  • Remark 1
  • Remark 2
  • Definition 2: Approximate Factorization Method
  • Proposition 1
  • Remark 3
  • Proposition 2
  • Theorem 1: Marchenko-Pastur Theorem
  • proof : Proof of Prop. \ref{['prop:allspectra']}, first point
  • proof : Proof of Prop. \ref{['prop:allspectra']}, second point
  • ...and 5 more