Table of Contents
Fetching ...

Provably Optimal Memory Capacity for Modern Hopfield Models: Transformer-Compatible Dense Associative Memories as Spherical Codes

Jerry Yao-Chieh Hu, Dennis Wu, Han Liu

TL;DR

The paper establishes a provable, tight memory-capacity bound for kernelized modern Hopfield models by recasting stored memories as a spherical code on the unit sphere. It shows that maximal capacity is achieved when memories form an optimal spherical code and proves an exponential-in-$D_\Phi$ scaling of capacity, matching known lower bounds with a corresponding upper bound in the low-temperature regime. A sublinear-time algorithm, U-Hop+, is proposed to find a suitable learnable feature map $\Phi$ that achieves this optimal capacity, with convergence guarantees as temperature tends to zero. The authors validate the theory with experiments showing reduced metastable states, improved energy landscapes, and gains in multiple-instance learning tasks, while also analyzing the dimension-demand relation and practical implications for transformer-compatible memory layers.

Abstract

We study the optimal memorization capacity of modern Hopfield models and Kernelized Hopfield Models (KHMs), a transformer-compatible class of Dense Associative Memories. We present a tight analysis by establishing a connection between the memory configuration of KHMs and spherical codes from information theory. Specifically, we treat the stored memory set as a specialized spherical code. This enables us to cast the memorization problem in KHMs into a point arrangement problem on a hypersphere. We show that the optimal capacity of KHMs occurs when the feature space allows memories to form an optimal spherical code. This unique perspective leads to: (i) An analysis of how KHMs achieve optimal memory capacity, and identify corresponding necessary conditions. Importantly, we establish an upper capacity bound that matches the well-known exponential lower bound in the literature. This provides the first tight and optimal asymptotic memory capacity for modern Hopfield models. (ii) A sub-linear time algorithm $\mathtt{U}\text{-}\mathtt{Hop}$+ to reach KHMs' optimal capacity. (iii) An analysis of the scaling behavior of the required feature dimension relative to the number of stored memories. These efforts improve both the retrieval capability of KHMs and the representation learning of corresponding transformers. Experimentally, we provide thorough numerical results to back up theoretical findings.

Provably Optimal Memory Capacity for Modern Hopfield Models: Transformer-Compatible Dense Associative Memories as Spherical Codes

TL;DR

The paper establishes a provable, tight memory-capacity bound for kernelized modern Hopfield models by recasting stored memories as a spherical code on the unit sphere. It shows that maximal capacity is achieved when memories form an optimal spherical code and proves an exponential-in- scaling of capacity, matching known lower bounds with a corresponding upper bound in the low-temperature regime. A sublinear-time algorithm, U-Hop+, is proposed to find a suitable learnable feature map that achieves this optimal capacity, with convergence guarantees as temperature tends to zero. The authors validate the theory with experiments showing reduced metastable states, improved energy landscapes, and gains in multiple-instance learning tasks, while also analyzing the dimension-demand relation and practical implications for transformer-compatible memory layers.

Abstract

We study the optimal memorization capacity of modern Hopfield models and Kernelized Hopfield Models (KHMs), a transformer-compatible class of Dense Associative Memories. We present a tight analysis by establishing a connection between the memory configuration of KHMs and spherical codes from information theory. Specifically, we treat the stored memory set as a specialized spherical code. This enables us to cast the memorization problem in KHMs into a point arrangement problem on a hypersphere. We show that the optimal capacity of KHMs occurs when the feature space allows memories to form an optimal spherical code. This unique perspective leads to: (i) An analysis of how KHMs achieve optimal memory capacity, and identify corresponding necessary conditions. Importantly, we establish an upper capacity bound that matches the well-known exponential lower bound in the literature. This provides the first tight and optimal asymptotic memory capacity for modern Hopfield models. (ii) A sub-linear time algorithm + to reach KHMs' optimal capacity. (iii) An analysis of the scaling behavior of the required feature dimension relative to the number of stored memories. These efforts improve both the retrieval capability of KHMs and the representation learning of corresponding transformers. Experimentally, we provide thorough numerical results to back up theoretical findings.

Paper Structure

This paper contains 46 sections, 13 theorems, 72 equations, 5 figures, 5 tables, 1 algorithm.

Key Result

Lemma 2.1

Let $1 - p$ be the probability of successfully storing and retrieving a pattern. Assuming the patterns are normalized, the number of patterns $M_\Phi$ that can be stored and retrieved by the KHM, following the update rule KHM-update-rule, is lower-bounded by: where $C$ is the solution to $C = b/(W_0 (\exp{a + \ln{b}}))$, with $W_0(\cdot)$ being the principal branch of Lambert $W$ function, $a \co

Figures (5)

  • Figure 1: Energy Landscape under Different Iterations of \ref{['algorithm1']}. Left: $M=2$, Right: $M=4$. Lighter color represents higher energy. The first row represents the raw energy landscape without applying $\mathtt{U}\text{-}\mathtt{Hop}$+. The second to last row represents the energy landscape when $N = (1, 2, 5)$. The visualization shows that \ref{['algorithm1']} not only separates the local minima better, but also pushes memories closer to the fixed point.
  • Figure 2: Basins of Attraction Comparison of \ref{['algorithm1']}. The first row represents the raw Basins of Attraction without applying $\mathtt{U}\text{-}\mathtt{Hop}$+ or KHM. The second to last row shows the basins when $N = (1, 2, 5)$. Square points are memories. White area is where queries are not able to converge to a single memory. Colored area is where queries converges to the corresponding memory. The result indicates that $\mathtt{U}\text{-}\mathtt{Hop}$+ is capable of converging to fixed point fast and reduce metastable states. $1$ and $2$-entmax corresponds to Softmax ramsauer2020hopfield and Sparsemax hu2023SparseHopfield.
  • Figure 3: Separation Bound Numerical Simulation We visualize the bound presented in \ref{['lemma:separation-bound']} in 3-D dimension. The bound goes tighter as the number of points increases.
  • Figure 4: Assignment Problem in 2D We observe that the learned feature map consistently put similar pairs closer to each other, leading to preserving some level of semantic information.
  • Figure 5: Loss Curve of $\mathcal{L}$ w.r.t. different memory set size. We run separation maximization for 100 epochs on MNIST under 2 settings, $M=100/200$. We set $\tau=0.1$, learning rate 1e-3, $D_\Phi=100$. The result shows $\mathcal{L}$ converges fast, which echoes our sub-linear time complexity.

Theorems & Definitions (34)

  • Definition 2.1: Generalized Fixed Point sriperumbudur2009convergence
  • Remark 1
  • Definition 2.2: Pattern Storage and Retrieval
  • Lemma 2.1: Memory Capacity of KHM
  • proof
  • Definition 2.3: Spherical Code
  • Definition 2.4: Minimal Separation
  • Definition 2.5: Optimal Spherical Code
  • Definition 2.6
  • Definition 2.7: Kernelized Well-Separation Condition wu2024uniformwu2023stanhophu2023SparseHopfieldramsauer2020hopfield
  • ...and 24 more