Determinantal Point Process Attention Over Grid Cell Code Supports Out of Distribution Generalization

Shanka Subhra Mondal; Steven Frankland; Taylor Webb; Jonathan D. Cohen

Determinantal Point Process Attention Over Grid Cell Code Supports Out of Distribution Generalization

Shanka Subhra Mondal, Steven Frankland, Taylor Webb, Jonathan D. Cohen

TL;DR

This work tackles OOD generalization by combining brain-inspired grid cell representations with a Determinantal Point Process (DPP) based attention (DPP-A). The authors propose a two-component framework where grid-cell embeddings encode relational structure and DPP-A gates emphasize diverse, high-variance, low-redundancy grid-cell features, guided by a training objective that pairs a task loss with the DPP objective. Across analogy and arithmetic tasks, and across LSTM and Transformer inference modules, the approach yields strong, often near-perfect, OOD generalization in translation and scale, outperforming temporal-context normalization, dropout, L1 regularization, and non-DPP baselines. The results suggest a biologically plausible path toward robust generalization in neural networks and motivate future work on nonlinear spaces, broader datasets, and biologically plausible implementations of DPP-like selection mechanisms.

Abstract

Deep neural networks have made tremendous gains in emulating human-like intelligence, and have been used increasingly as ways of understanding how the brain may solve the complex computational problems on which this relies. However, these still fall short of, and therefore fail to provide insight into how the brain supports strong forms of generalization of which humans are capable. One such case is out-of-distribution (OOD) generalization-successful performance on test examples that lie outside the distribution of the training set. Here, we identify properties of processing in the brain that may contribute to this ability. We describe a two-part algorithm that draws on specific features of neural computation to achieve OOD generalization, and provide a proof of concept by evaluating performance on two challenging cognitive tasks. First we draw on the fact that the mammalian brain represents metric spaces using grid cell code (e.g., in the entorhinal cortex): abstract representations of relational structure, organized in recurring motifs that cover the representational space. Second, we propose an attentional mechanism that operates over the grid cell code using Determinantal Point Process (DPP), that we call DPP attention (DPP-A) -- a transformation that ensures maximum sparseness in the coverage of that space. We show that a loss function that combines standard task-optimized error with DPP-A can exploit the recurring motifs in the grid cell code, and can be integrated with common architectures to achieve strong OOD generalization performance on analogy and arithmetic tasks. This provides both an interpretation of how the grid cell code in the mammalian brain may contribute to generalization performance, and at the same time a potential means for improving such capabilities in artificial neural networks.

Determinantal Point Process Attention Over Grid Cell Code Supports Out of Distribution Generalization

TL;DR

Abstract

Paper Structure (29 sections, 1 theorem, 14 equations, 21 figures, 1 algorithm)

This paper contains 29 sections, 1 theorem, 14 equations, 21 figures, 1 algorithm.

Introduction
Approach
Task setup
Analogy task
Arithmetic task
Architecture
Grid cell code
DPP-A
Inference module
Related work
Experiments
Experimental details
Comparison models
Results
Analogy
...and 14 more sections

Key Result

Theorem 2.1

For a positive semidefinite matrix $\bm{V}$ and $\bm{w} \in [0,1]^N$:

Figures (21)

Figure 1: Schematic of the overall framework. Given a task (e.g., an analogy to solve), inputs (denoted as $\{A, B, C, D\}$) are represented by the grid cell code, consisting of units ("grid cells") representing different combinations of frequencies and phases. Grid cell embeddings ($\bm{x}_A, \bm{x}_B, \bm{x}_C, \bm{x}_D$) are multiplied elementwise (represented as a Hadamard product $\odot$) by a set of learned attention gates $\bm{g}$, then passed to the inference module $\bm{R}$. The attention gates $\bm{g}$ are optimized using $\mathcal{L}_{DPP}$, which encourages attention to grid cell embeddings that maximize the volume of the representational space. The inference module outputs a score for each candidate analogy (consisting of $A, B, C$ and a candidate answer choice $D$). The scores for all answer choices are passed through a softmax to generate an answer $\hat{y}$, which is compared against the target $y$ to generate the task loss $\mathcal{L}_{task}$.
Figure 2: Generation of test analogies from training analogies (region marked in blue) by: a) translating both dimension values of $A, B, C, D$ by the same amount; and b) scaling both dimension values of $A, B, C, D$ by the same amount. Since both dimension values are transformed by the same amount, each input gets transformed along the diagonal.
Figure 3: Results on analogy on each region for translation and scaling using LSTM in the inference module.
Figure 4: Results on analogy on each region for translation and scaling using the transformer in the inference module.
Figure 5: Results on arithmetic on each region using LSTM in the inference module.
...and 16 more figures

Theorems & Definitions (1)

Theorem 2.1

Determinantal Point Process Attention Over Grid Cell Code Supports Out of Distribution Generalization

TL;DR

Abstract

Determinantal Point Process Attention Over Grid Cell Code Supports Out of Distribution Generalization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (1)