Table of Contents
Fetching ...

Deep sequence models tend to memorize geometrically; it is unclear why

Shahriar Noroozizadeh, Vaishnavh Nagarajan, Elan Rosenfeld, Sanjiv Kumar

TL;DR

The paper identifies geometric memory as a global, synthesized embedding geometry that encodes multi-hop relations, challenging the prevailing associative-memory narrative. By designing an in-weights path-star task, it shows that deep sequence models can memorize graphs in their parameters and use geometry to convert long $\$-hop reasoning into simple navigational steps, even without global supervision. A central contribution is linking this geometry to spectral biases, notably those seen in Node2Vec, and showing that cross-entropy loss naturally induces a low-rank, globally structured representation that traverses beyond local co-occurrences. The findings raise fundamental questions about how associative and geometric memories compete during training and suggest practical avenues to steer Transformer memory toward more geometric representations, with broad implications for knowledge acquisition, retrieval, and unlearning.

Abstract

Deep sequence models are said to store atomic facts predominantly in the form of associative memory: a brute-force lookup of co-occurring entities. We identify a dramatically different form of storage of atomic facts that we term as geometric memory. Here, the model has synthesized embeddings encoding novel global relationships between all entities, including ones that do not co-occur in training. Such storage is powerful: for instance, we show how it transforms a hard reasoning task involving an $\ell$-fold composition into an easy-to-learn $1$-step navigation task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, as against a lookup of local associations, cannot be straightforwardly attributed to typical supervisory, architectural, or optimizational pressures. Counterintuitively, a geometry is learned even when it is more complex than the brute-force lookup. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that -- in contrast to prevailing theories -- indeed arises naturally despite the lack of various pressures. This analysis also points out to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery, and unlearning.

Deep sequence models tend to memorize geometrically; it is unclear why

TL;DR

The paper identifies geometric memory as a global, synthesized embedding geometry that encodes multi-hop relations, challenging the prevailing associative-memory narrative. By designing an in-weights path-star task, it shows that deep sequence models can memorize graphs in their parameters and use geometry to convert long -hop reasoning into simple navigational steps, even without global supervision. A central contribution is linking this geometry to spectral biases, notably those seen in Node2Vec, and showing that cross-entropy loss naturally induces a low-rank, globally structured representation that traverses beyond local co-occurrences. The findings raise fundamental questions about how associative and geometric memories compete during training and suggest practical avenues to steer Transformer memory toward more geometric representations, with broad implications for knowledge acquisition, retrieval, and unlearning.

Abstract

Deep sequence models are said to store atomic facts predominantly in the form of associative memory: a brute-force lookup of co-occurring entities. We identify a dramatically different form of storage of atomic facts that we term as geometric memory. Here, the model has synthesized embeddings encoding novel global relationships between all entities, including ones that do not co-occur in training. Such storage is powerful: for instance, we show how it transforms a hard reasoning task involving an -fold composition into an easy-to-learn -step navigation task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, as against a lookup of local associations, cannot be straightforwardly attributed to typical supervisory, architectural, or optimizational pressures. Counterintuitively, a geometry is learned even when it is more complex than the brute-force lookup. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that -- in contrast to prevailing theories -- indeed arises naturally despite the lack of various pressures. This analysis also points out to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery, and unlearning.

Paper Structure

This paper contains 6 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Associative vs. geometric memory of models trained on various graphs. There are two dramatically different ways to memorize a dataset of atomic facts. The common view is of associative memory: entities are embedded arbitrarily, and co-occurrences are stored in weight matrices. (left). §\ref{['sec:competing-views']}: In practice, we find a geometric memory: the learned embeddings of a Transformer (middle) reflect global structure inferred from the local co-occurrences in training data. §\ref{['sec:geometry-spectral-bias']}: When associative memory is explicitly prohibited (by removing intermediate layers), as in a Node2Vec model (right), a more elegant geometry materializes. This points to a clear headroom to improve the geometric nature of a Transformer's memory. Details of the Transformer architecture used for this visualization are provided in §\ref{['app:tiny-model-architecture']}. Similar geometries for Mamba and neural networks are presented in §\ref{['sec:tiny-graphs']}.
  • Figure 2: Overview of in-context path-star task of bachmann24pitfalls. Each training and test example corresponds to a fresh, randomly-labeled path-star graph (a tree graph where only the root node branches into $d$ paths of length $\ell$). For each example, the prefix specifies a randomized adjacency list (of edge bigrams) of the corresponding graph, followed by $({v}_{\tt{root}},{v}_{\tt{goal}})$. The target is the full path $({v}_{\tt{root}} \to {v}_{\tt{goal}})$ in that graph.
  • Figure 3: Overview of our in-weights path-star task. All examples are derived from a fixed path-star graph. Training involves two types of examples: (i) edge memorization examples; (ii) path-finding examples, where the prefix is some leaf, and the target is the full path. Test examples are path examples corresponding to a held-out set of leaves.