Table of Contents
Fetching ...

Dense Associative Memory Through the Lens of Random Features

Benjamin Hoover, Duen Horng Chau, Hendrik Strobelt, Parikshit Ram, Dmitry Krotov

TL;DR

This work proposes an alternative formulation of this class of models using random features, which closely approximates the energy function and dynamics of conventional Dense Associative Memories and shares their desirable computational properties.

Abstract

Dense Associative Memories are high storage capacity variants of the Hopfield networks that are capable of storing a large number of memory patterns in the weights of the network of a given size. Their common formulations typically require storing each pattern in a separate set of synaptic weights, which leads to the increase of the number of synaptic weights when new patterns are introduced. In this work we propose an alternative formulation of this class of models using random features, commonly used in kernel methods. In this formulation the number of network's parameters remains fixed. At the same time, new memories can be added to the network by modifying existing weights. We show that this novel network closely approximates the energy function and dynamics of conventional Dense Associative Memories and shares their desirable computational properties.

Dense Associative Memory Through the Lens of Random Features

TL;DR

This work proposes an alternative formulation of this class of models using random features, which closely approximates the energy function and dynamics of conventional Dense Associative Memories and shares their desirable computational properties.

Abstract

Dense Associative Memories are high storage capacity variants of the Hopfield networks that are capable of storing a large number of memory patterns in the weights of the network of a given size. Their common formulations typically require storing each pattern in a separate set of synaptic weights, which leads to the increase of the number of synaptic weights when new patterns are introduced. In this work we propose an alternative formulation of this class of models using random features, commonly used in kernel methods. In this formulation the number of network's parameters remains fixed. At the same time, new memories can be added to the network by modifying existing weights. We show that this novel network closely approximates the energy function and dynamics of conventional Dense Associative Memories and shares their desirable computational properties.

Paper Structure

This paper contains 26 sections, 11 theorems, 35 equations, 7 figures, 1 algorithm.

Key Result

Proposition 1

With access to the $K$ memories $\{ \boldsymbol{\xi}^\mu \in \mathbb{R}^D, \mu \in \llbracket K \rrbracket \}$, MrDAM takes $O(LKD)$ time and $O(KD)$ peak memory for $L$ energy gradient descent steps (or layers) as defined in eq:gen-up with the true energy gradient $\nabla_\mathbf{x} E(\mathbf{x})

Figures (7)

  • Figure 1: The Distributed Representation for Dense Associative Memory (DrDAM) approximates both the energy and fixed-point dynamics of the traditional Memory Representation for Dense Associative Memory (MrDAM) while having a parameter space of constant size. A) Diagram of DrDAM using a icons/basis-func.svg basis function parameterized by random features (e.g., see \ref{['eq:rf-trig']}). In the distributed representation, adding new memories does not change the size of the memory tensor. B) Comparing energy descent dynamics between DrDAM and MrDAM on 3x64x64 images from Tiny Imagenet Le2015TinyIV. Both models are initialized on queries where the bottom two-thirds of pixels are occluded with zeros; dynamics are run while clamping the visible pixels and their collective energy traces shown. DrDAM achieves the same fixed points as MrDAM, and these final fixed points have the same energy. The energy decreases with time for both MrDAM and DrDAM, although the dependence of the energy relaxation towards the fixed point is sometimes different between the two representations. Experimental setup is described in \ref{['sec:details-fig1']}.
  • Figure 2: DrDAM achieves parameter compression over MrDAM, successfully storing $20$ different 64x64x3 images from TinyImagenet Le2015TinyIV and retrieving them when occluding the lower 40% of each query. The memory matrix of MrDAM is of shape $(20, 12288)$ while the memory tensor of DrDAM is of shape $Y=2\cdot 10^5$, a ${\sim}20$% reduction in the number of parameters compared to MrDAM; all other configurations for this experiment match those in \ref{['sec:details-fig1']}. Further compression can be achieved with a higher tolerance for DrDAM's retrieval error, smaller $\beta$, and fewer occluded pixels, see \ref{['sec:eval']}. Top: Occluded query images. Middle: Fixed-point retrievals from DrDAM. Bottom: (ground truth) Fixed-point retrievals of MrDAM.
  • Figure 3: DrDAM produces better approximations to the energies and gradients of MrDAM when the queries are closer to the stored patterns. Approximation quality improves with larger feature dimension $Y$, but decreases with higher $\beta$ and higher pattern dimension $D$. Approximation error is computed on $500$ stored binary patterns normalized between $\{0, \frac{1}{\sqrt{D}}\}$. The Mean Approximation Errors ( MAE, \ref{['eq:MAE']}) is taken over $500$ queries initialized: at stored patterns (i.e., queries equal the stored patterns), near stored patterns (i.e., queries equal the stored patterns where 10% of the bits have been flipped), and randomly (i.e., queries are random and far from stored patterns). Error bars represent the standard error of the mean but are visible only at poor approximations. Red horizontal lines represent the expected error of random energies and gradients. The theoretical error upper bounds of \ref{['eq:lrate-div-ub']} (dark curves on the gradient errors in the right plot only) show a tight fit to empirical results at low $\beta$ and $D$ and are only shown if predictions are "better than random". The shaded area shows the difference between the theoretical bound and the empirical results.
  • Figure 4: A) Retrieval errors predictably follow the approximation quality of \ref{['fig:quant1b']}. Error is lowest at/near stored patterns but is completely random when energy and gradient approximations are poor, i.e., at high values of $\beta$ and $D$. Note that error improves across $Y$ but follows a different (and noisier) trace than the corresponding approximations for energy and gradient in \ref{['fig:quant1b']} due to error accumulating over multiple update steps. B)DrDAM's approximation quality improves as $Y$ increases (visible at low $\beta$), but larger $Y$'s are needed for good approximations to the DAM's fixed points at higher $\beta$'s. (Left) The same corrupted query from CIFAR-10 where bottom 50% is masked is presented to DAM's with different $\beta$'s. (Middle) The fixed points of DrDAM for each $\beta$ at different sizes $Y$ of the feature space. (Right) The "ground truth" fixed point of MrDAM. The top 50% of pixels are clamped throughout the dynamics.
  • Figure 5: Mean Approximation Error (MAE, \ref{['eq:MAE']}) increases as the number of stored patterns $K$ increases (except at random starting positions, where more stored patterns increases the probability that a random query is closer to a memory, a regime that leads to higher accuracy of the retrievals, see \ref{['fig:quant1b']}), keeping $Y=2e5$ constant across all experiments.
  • ...and 2 more figures

Theorems & Definitions (15)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Theorem 1
  • Proposition 5
  • Theorem 2
  • Corollary 1
  • proof : Proof of \ref{['thm:kdam-egd-cc']}
  • Lemma 1: adapted from li2023transformers Lemma B.1
  • ...and 5 more