Table of Contents
Fetching ...

Neural Computation Without Slots: Steps Towards Biologically Plausible Memory and Attention in Natural and Artificial Intelligence

Shaunak Bhandarkar, James L. McClelland

TL;DR

This work argues for memory and attention without explicit slots by extending modern Hopfield networks to a fixed-capacity, sparse, distributed memory (the K-winner MHN). It then demonstrates how MHN-based components can approximate slot-based attention in a minimal transformer by storing keys and values in fast, connection-weighted memories and learning slow representations for queries and values. Across unstructured and structured memory patterns, the K-winner MHN shows enhanced retention of older memories with only modest costs to initial retrieval, and in the transformer setting the QK-MHN-Transformer and related variants can achieve perfect in-context learning on a Case Sequence Task, with semantically meaningful structuring emerging in the learned weight matrices. The results provide a principled, biologically plausible bridge between slot-based AI mechanisms and distributed, weight-based memory, with implications for memory, attention, and continual learning in both brains and AI systems.

Abstract

Many models used in artificial intelligence and cognitive science rely on multi-element patterns stored in "slots" - dedicated storage locations - in a digital computer. As biological brains likely lack slots, we consider how they might achieve similar functional outcomes without them by building on the neurally-inspired modern Hopfield network (MHN; Krotov & Hopfield, 2021), which stores patterns in the connection weights of an individual neuron. We propose extensions of this approach to increase its biological plausibility as a model of memory and to capture an important advantage of slot-based computation in contemporary language models. For memory, neuroscience research suggests that the weights of overlapping sparse ensembles of neurons, rather than a dedicated individual neuron, are used to store a memory. We introduce the K-winner MHN, extending the approach to ensembles, and find that within a continual learning regime, the ensemble-based MHN exhibits greater retention of older memories, as measured by the graded sensitivity measure d', than a standard (one-neuron) MHN. Next, we consider the powerful use of slot-based memory in contemporary language models. These models use slots to store long sequences of past inputs and their learned encodings, supporting later predictions and allowing error signals to be transported backward in time to adjust weights underlying the learned encodings of these past inputs. Inspired by these models' successes, we show how the MHN can be extended to capture both of these important functional outcomes. Collectively, our modeling approaches constitute steps towards understanding how biologically plausible mechanisms can support computations that have enabled AI systems to capture human-like abilities that no prior models have been able to achieve.

Neural Computation Without Slots: Steps Towards Biologically Plausible Memory and Attention in Natural and Artificial Intelligence

TL;DR

This work argues for memory and attention without explicit slots by extending modern Hopfield networks to a fixed-capacity, sparse, distributed memory (the K-winner MHN). It then demonstrates how MHN-based components can approximate slot-based attention in a minimal transformer by storing keys and values in fast, connection-weighted memories and learning slow representations for queries and values. Across unstructured and structured memory patterns, the K-winner MHN shows enhanced retention of older memories with only modest costs to initial retrieval, and in the transformer setting the QK-MHN-Transformer and related variants can achieve perfect in-context learning on a Case Sequence Task, with semantically meaningful structuring emerging in the learned weight matrices. The results provide a principled, biologically plausible bridge between slot-based AI mechanisms and distributed, weight-based memory, with implications for memory, attention, and continual learning in both brains and AI systems.

Abstract

Many models used in artificial intelligence and cognitive science rely on multi-element patterns stored in "slots" - dedicated storage locations - in a digital computer. As biological brains likely lack slots, we consider how they might achieve similar functional outcomes without them by building on the neurally-inspired modern Hopfield network (MHN; Krotov & Hopfield, 2021), which stores patterns in the connection weights of an individual neuron. We propose extensions of this approach to increase its biological plausibility as a model of memory and to capture an important advantage of slot-based computation in contemporary language models. For memory, neuroscience research suggests that the weights of overlapping sparse ensembles of neurons, rather than a dedicated individual neuron, are used to store a memory. We introduce the K-winner MHN, extending the approach to ensembles, and find that within a continual learning regime, the ensemble-based MHN exhibits greater retention of older memories, as measured by the graded sensitivity measure d', than a standard (one-neuron) MHN. Next, we consider the powerful use of slot-based memory in contemporary language models. These models use slots to store long sequences of past inputs and their learned encodings, supporting later predictions and allowing error signals to be transported backward in time to adjust weights underlying the learned encodings of these past inputs. Inspired by these models' successes, we show how the MHN can be extended to capture both of these important functional outcomes. Collectively, our modeling approaches constitute steps towards understanding how biologically plausible mechanisms can support computations that have enabled AI systems to capture human-like abilities that no prior models have been able to achieve.

Paper Structure

This paper contains 56 sections, 7 theorems, 106 equations, 20 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Suppose $M$ is a slot-based (localist) MHN with input size $n_v$, input sparsity level $s_v$, and (sufficiently large) hidden size $n_h$, feedforward function $f$, and that $M$ has been trained on $N$ patterns where $N \to \infty$. Assume additionally that there exists a positive constant $c_0 << s_ for any $\epsilon \in (0, 1)$; moreover, if it holds that $c > \theta$, it follows that Here, $k_v

Figures (20)

  • Figure 1: An illustration conceptualizing the equivalence between softmax-weighted associative memory retrieval and MHN-based retrieval, in two main settings. A. The original auto-associative MHN, as conceived by Krotov and Hopfield KrotovHopfield2021. For a given queried input, its dot products with predetermined vectors $x_1, \dots, x_{n_h}$ are computed, and the resulting softmaxed dot products are used to return a weighted combination of these same vectors. Such a computation may be realized within an autoencoder architecture with bidirectional connection weights storing the $x_i$'s. Performing multiple cycles of retrieval enables stable retrieval of the best-matching $x_i$. B. A hetero-associative MHN that stores pairs of vectors---keys and values---and that uses keys to retrieve associated values. Crucially, instead of storing keys and values as neural activity states, they may be encoded in the incoming and outgoing weights of single neurons, respectively. The resulting network's computation coincides with that of the transformer self-attention mechanism. In contrast to the autoassociative MHN, this network is feedforward, i.e. it cannot be run for multiple cycles. In A and B, hypothetical input/output neuron and "memory" neuron activations are shown in gray and red, respectively.
  • Figure 2: Backpropagation of gradients through an aggregation of slots effectively constitutes backpropagation through time---a ubiquitous issue among recurrent neural architectures more broadly. A. The backpropagation algorithm in a transformer self-attention head that processes the input sequence $x_1, \dots, x_T$ followed by the query $x_q$ for which the ground truth label is $y$. Gradients (shown via red arrows) backpropagate through the self-attention computation and iteratively produce gradient updates (i.e. for the key and value weights) that consist of an outer product between the gradient of a given key $k_t$ or value $v_t$ with the current context item $x_t$; see SI Appendix \ref{['subsec:gradient_eqns']}, for a detailed breakdown. These updates are then summed over the entire context window to produce the total gradient update for a given learnable weight matrix (e.g. $W_K$, $W_V$). B. The backpropagation through time (BPTT) algorithm in a standard recurrent neural network (RNN), where the hidden state is given recursively as $h_t = \sigma\left( W_{HH}h_{t-1} + W_{HI}x_t \right)$. For any learnable weight matrix, gradient updates for that matrix are iteratively computed for earlier time steps (again, as outer products between gradients of receiving units and activity patterns of sending units), and these updates are summed to produce the total weight update.
  • Figure 3: A comparison between the modern Hopfield network (MHN) and its variants. In contrast to the original MHN (left), K-winner MHNs (middle and right) are able to learn through weight updates as each input pattern is observed. Moreover, in the K-winner MHN, only weights projecting into and out of the $k$ hidden units with the largest values are updated (shown in green). In contrast to the 1-winner MHN (middle), general K-winner MHNs (right) allow for distributed hidden state representations, graded weight updates, and sparse network connectivity.
  • Figure 4: A Comparison of a candidate K-winner MHN's and original MHN's retrieval ability for memories of different ages, relative to their an untrained (pseudo-memory) baselines, for both 100% cues (left) and 50% cues (right). Results were averaged over $100$ independent runs of each model. B Retrieval sensitivity $d'$ for the given K-winner MHN and original MHN using 100% cues (left) and 50% cues (right). Cyan: $d'$ standard error. Horizontal segments show ages where K-winner MHN $d'$ is higher (red) and MHN $d'$ is higher (black), with uncorrected $p < 0.01$.
  • Figure 5: A. A visual description of the Case Sequence Task for a sample input sequence. B. A diagram of the "minimal" transformer architecture, in which the context window consists of two inputs (for simplicity). The dot products between the embedded query $q$ and the contextual keys $k_t$ are computed, and the softmax of these values is computed by applying an exponential nonlinearity to each $k_t^T q$ term and subsequently normalizing (illustrated in the purple shading). The resulting attention scores are used to modulate the linear combination of the values $v_t$ that is produced as the output. The supervisory training signal for gradient descent only arrives at this final output.
  • ...and 15 more figures

Theorems & Definitions (14)

  • Theorem 1
  • Proposition 2
  • Remark 3
  • proof
  • Theorem 4
  • Proposition 5
  • Remark 6
  • proof
  • proof : Proof of Theorem \ref{['thm:mhn_theory']}
  • Theorem 7
  • ...and 4 more