Table of Contents
Fetching ...

A recurrent vision transformer shows signatures of primate visual attention

Jonathan Morgan, Badr Albanna, James P. Herman

TL;DR

The paper presents a Recurrent Vision Transformer that injects spatial working memory into self-attention to emulate primate visual attention. Trained with sparse reinforcement learning on a cued orientation-change task, the model demonstrates classic attentional benefits, anticipatory memory-guided allocation, and causal-like perturbation effects, closely mirroring primate data. A key finding is that multiplicative memory feedback within a memory-attention loop is essential to reproduce the full spectrum of primate-like attention signatures. This work advances biologically plausible AI by coupling memory, attention, and reward-driven learning, offering a framework to study how perception, memory, and decision-making co-evolve in dynamic environments.

Abstract

Attention is fundamental to both biological and artificial intelligence, yet research on animal attention and AI self attention remains largely disconnected. We propose a Recurrent Vision Transformer (Recurrent ViT) that integrates self-attention with recurrent memory, allowing both current inputs and stored information to guide attention allocation. Trained solely via sparse reward feedback on a spatially cued orientation change detection task, a paradigm used in primate studies, our model exhibits primate like signatures of attention, including improved accuracy and faster responses for cued stimuli that scale with cue validity. Analysis of self-attention maps reveals dynamic spatial prioritization with reactivation prior to expected changes, and targeted perturbations produce performance shifts similar to those observed in primate frontal eye fields and superior colliculus. These findings demonstrate that incorporating recurrent feedback into self attention can capture key aspects of primate visual attention.

A recurrent vision transformer shows signatures of primate visual attention

TL;DR

The paper presents a Recurrent Vision Transformer that injects spatial working memory into self-attention to emulate primate visual attention. Trained with sparse reinforcement learning on a cued orientation-change task, the model demonstrates classic attentional benefits, anticipatory memory-guided allocation, and causal-like perturbation effects, closely mirroring primate data. A key finding is that multiplicative memory feedback within a memory-attention loop is essential to reproduce the full spectrum of primate-like attention signatures. This work advances biologically plausible AI by coupling memory, attention, and reward-driven learning, offering a framework to study how perception, memory, and decision-making co-evolve in dynamic environments.

Abstract

Attention is fundamental to both biological and artificial intelligence, yet research on animal attention and AI self attention remains largely disconnected. We propose a Recurrent Vision Transformer (Recurrent ViT) that integrates self-attention with recurrent memory, allowing both current inputs and stored information to guide attention allocation. Trained solely via sparse reward feedback on a spatially cued orientation change detection task, a paradigm used in primate studies, our model exhibits primate like signatures of attention, including improved accuracy and faster responses for cued stimuli that scale with cue validity. Analysis of self-attention maps reveals dynamic spatial prioritization with reactivation prior to expected changes, and targeted perturbations produce performance shifts similar to those observed in primate frontal eye fields and superior colliculus. These findings demonstrate that incorporating recurrent feedback into self attention can capture key aspects of primate visual attention.

Paper Structure

This paper contains 34 sections, 41 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Each trial in the task comprises seven time steps. In each time step, a $50 \times 50$ grayscale image is input to the model. Black images are shown at $t=0$ and $t=2$. The cue is shown at $t=1$ and can either be at $S_1$ (top left) or $S_4$ (bottom right). The cue can take four configurations, where the portion of the circumference subtended by the ring around the center disk indicates the probability (25%, 50%, 75%, or 100%) that the change will appear at the cued location if the trial is a change trial.
  • Figure 2: Model Schematic. At each timestep a single image is input, parsed into four patches, and passed through a pre-processing stage (see Methods). The resulting low-level visual features, $X^{(t)} = \{x_i^{(t)}\}_{i=1}^{4}$, are combined with the activated memory, $H^{(t-1)} = \{h_i^{(t-1)}\}_{i=1}^{4}$ in a self-attention mechanism, producing spatio-temporal context vectors ($\alpha_i \xi_i$). Context vectors are added to the low-level features and processed together to yield $Z^{(t)} = \{z_i^{(t)}\}_{i=1}^{4}$. The memory is then updated $C^{(t)} = \{c_i^{(t)}\}_{i=1}^{4}$ using both $Z^{(t)}$ and the previous memory $H^{(t-1)}$. The updated memory $H^{(t)}$ is both fed back into the self-attention mechanism and forward to the RL Agent's actor and critic networks. The actor network uses $H^{(t)}$ to select an action ($\text{\lq wait\rq}$ or $\text{\lq declare change\rq}$), while the critic network estimates upcoming cumulative rewards. Purple lines indicate weights updated by reward feedback.
  • Figure 3: A--F shows the response rates (A--C) and reaction times (D--F) of our agent over varying cue validities with respect to the $S_1$ location and either a change on $S_1$ or a change on $S_4$ positions. Each data point was 500 trials where $\Delta$ specifies the magnitude of the orientation change. The response-rate was computed as $n_\text{dc}/n_\text{trials}$, where $n_\text{dc}$ is the total number of trials in which the agents selected the action $a^{(t)}=\text{"declare change"}$ and $n_{trials}$ is the total number of trials. The reaction times were computed as $1/500 \sum_i \tau_{i}$, where $\tau_{i}$ is the time the trial ended, either by the agent declaring a change or waiting through the final timestep. A Response rates over each possible cue condition w.r.t. the $S_1$ position where changes also occurred at the $S_1$ position. B Response rates computed over trials with a cue at the $S_1$ position comparing changes at the $S_1$ versus the $S_4$ locations. C Similar to B but with a 100 % cue at the $S_1$ location. D--F Same conditions as A--C showing the mean reaction times.
  • Figure 4: A Averaged self-attention maps at each timestep when $S_1$ is cued and orientation change $\Delta=0$ (no-change trials). Rows correspond to different cue validities (25%, 50%, 75%, 100%), and columns to timesteps $t=0,\ldots, 6$. Darker squares in each $2\times2$ attention map reflect stronger attention. B--C Attentional bias ($\alpha_1^{t_\text{change}}$) on $S_1$ as a function of orientation change $\Delta$ for each cue validity. The bias $\alpha_1^{(t)}$ is the top-left value from the heatmaps in A. D Attentional bias on $S_1$ (blue: $\alpha_1^{(t)}$) and $S_4$ (red: $\alpha_4^{(t)}$) as a function of timestep when $S_1$ is cued and orientation change $\Delta=0$ (no-change trials).
  • Figure 5: Plots showing the effect of artificially modulating the bias. All data points are the result of an average over 500 trials. Artificial modulation involves inducing a high bias in a single spatial region (increasing the value of one of the patches in the self-attention maps from Figure \ref{['fig:attention']}G). In all cases, the bias is induced with respect to the $S_1$ ($\alpha_1^{(t)}$) or the $S_4$ ($\alpha_4^{(t)}$) stimulus location. In A--F, we plot the response rates and reaction times versus the orientation change $\Delta$. If $\alpha_i^{(t_{\text{change}})}=1$, this indicates that the transmission $Z^{(t_{\text{change}})}$ has been completely biased toward $\xi_i^{(t_{\text{change}})}$. In A--C we show the effects of this manipulation on the response rates, and then again for the reaction times in D--F.