Table of Contents
Fetching ...

Modern Hopfield Networks meet Encoded Neural Representations -- Addressing Practical Considerations

Satyananda Kashyap, Niharika S. D'Souza, Luyao Shi, Ken C. L. Wong, Hongzhi Wang, Tanveer Syeda-Mahmood

TL;DR

Experimental results demonstrate substantial reduction in meta-stable states and increased storage capacity while still enabling perfect recall of a significantly larger number of inputs advancing the practical utility of associative memory networks for real-world tasks.

Abstract

Content-addressable memories such as Modern Hopfield Networks (MHN) have been studied as mathematical models of auto-association and storage/retrieval in the human declarative memory, yet their practical use for large-scale content storage faces challenges. Chief among them is the occurrence of meta-stable states, particularly when handling large amounts of high dimensional content. This paper introduces Hopfield Encoding Networks (HEN), a framework that integrates encoded neural representations into MHNs to improve pattern separability and reduce meta-stable states. We show that HEN can also be used for retrieval in the context of hetero association of images with natural language queries, thus removing the limitation of requiring access to partial content in the same domain. Experimental results demonstrate substantial reduction in meta-stable states and increased storage capacity while still enabling perfect recall of a significantly larger number of inputs advancing the practical utility of associative memory networks for real-world tasks.

Modern Hopfield Networks meet Encoded Neural Representations -- Addressing Practical Considerations

TL;DR

Experimental results demonstrate substantial reduction in meta-stable states and increased storage capacity while still enabling perfect recall of a significantly larger number of inputs advancing the practical utility of associative memory networks for real-world tasks.

Abstract

Content-addressable memories such as Modern Hopfield Networks (MHN) have been studied as mathematical models of auto-association and storage/retrieval in the human declarative memory, yet their practical use for large-scale content storage faces challenges. Chief among them is the occurrence of meta-stable states, particularly when handling large amounts of high dimensional content. This paper introduces Hopfield Encoding Networks (HEN), a framework that integrates encoded neural representations into MHNs to improve pattern separability and reduce meta-stable states. We show that HEN can also be used for retrieval in the context of hetero association of images with natural language queries, thus removing the limitation of requiring access to partial content in the same domain. Experimental results demonstrate substantial reduction in meta-stable states and increased storage capacity while still enabling perfect recall of a significantly larger number of inputs advancing the practical utility of associative memory networks for real-world tasks.
Paper Structure (14 sections, 11 equations, 8 figures, 5 tables)

This paper contains 14 sections, 11 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Progression of the (Top Row) Modern Hopfield Network (MHN) run on image inputs and the (Bottom Row) HEN at different intermediate steps. The sequence from left to right is as follows: the original image, the query image with half of it occluded, an intermediate update at iteration 11, and the final reconstruction at iteration 150. This uses an $\ell_2$ similarity and discrete Variational Autoencoder (D-VAE) encoder for the HEN. We set $\beta=150$ in both experiments
  • Figure 2: Memory recall performance of various encoder methods and the image-based Modern Hopfield network, each color-coded differently. The (Left) figure plots the MSE while the (Right) depicts the 1-SSIM as a function of $\beta$. The encoder-based HEN methods outperform the raw image-based method over a very large range of choices of hyperparameters. Dot or L2 in the legend denote the dot product or $\ell_2$, followed by the type of representation used including Original input-Image, dVAE-Discrete Variational Auto Encoder, Kullback-Leibler (KL)-based variants- KLf8, KLf16, and Vector quantized VAE methods-VQf8, VQf16 per the convention in rombach2021high
  • Figure 3: Illustrating separability in various embeddings used in Subsection \ref{['Encoded']}. We plot the histogram of the distribution of cosine similarity values between the queries and memories, i.e. $\text{cos}(\hat{\mathbf{S}}_{i}^{(0)},\hat{\mathbf{\xi}}_{j})$. Distributions colored indicate self-similarity (i.e. $i=j$) across paired examples, while distributions colored in blue indicate cross similarities (i.e. $i \neq j$). We generate these distributions for (a) Raw Images in the Black Box, Diffusion models trained (b) on KL Divergence in the Orange Box, (c) trained using Vector Quantization in the Brown Box, and (d) Discrete-VAE (D-VAE) in the Red Box.
  • Figure 4: HEN architecture for Natural language based hetero-associations. Orange Box: Paired text and image inputs are fed to text ($\mathbf{\Phi}^{\mathbf{T}}_{\text{enc}}(\cdot)$) and image ($\mathbf{\Phi}^{\mathbf{I}}_{\text{enc}}(\cdot)$) encoders respectively to generate the memories of the HEN memory bank. Grey Box: At query time, a partial query $\hat{\mathbf{s}}^{(0)}$ is generated by feeding the query text into $\mathbf{\Phi}^{\mathbf{T}}_{\text{enc}}(\cdot)$. After convergence, the image encoding is extracted from $\hat{\mathbf{s}}^{(T_{f})}$, and decoded through the Image decoder ($\mathbf{\Phi}^{\mathbf{I}}_{\text{dec}}(\cdot)$) to retrieve the corresponding image. Using full image representations instead of image encodings implies $\mathbf{\Phi}^{\mathbf{I}}_{\text{enc}}(\cdot) = \mathbf{\Phi}^{\mathbf{I}}_{\text{dec}}(\cdot) = \mathcal{I}_{K}$, the identity transformation. In the experiment where the text captions are pixelized as input, $\mathbf{\Phi}^{\mathbf{T}}_{\text{enc}}(\cdot) = \mathbf{\Phi}^{\mathbf{I}}_{\text{enc}}(\cdot)$.
  • Figure 5: Step-by-step progression of a cross-modal query using two different encodings for associated visual and language cues. Top Row: The text stimulus is encoded via the CLIP radford2021learning text encoder and associated with the image represented by a D-VAE encoded vector. Bottom Row: The reconstruction process for heteroassociation. (L-R) Ground Truth, Iteration $t=0$ starting with a blank canvas with the provided CLIP Encoded text inputs as query prompts, an intermediate update, and the full reconstruction. This demonstrates the network's ability to accurately reconstruct the image from a text-only input from a completely different stimulus space as the image content.
  • ...and 3 more figures