Table of Contents
Fetching ...

Transformer brain encoders explain human high-level visual responses

Hossein Adeli, Sun Minni, Nikolaus Kriegeskorte

TL;DR

The paper tackles how retinotopic visual information is routed to high-level category-selective brain regions during natural viewing. It introduces a transformer-based brain encoder that uses cross-attention to dynamically gate information from retinotopic maps to ROI queries, outperforming traditional encoders across multiple backbones and modalities. Key contributions include demonstrating superior predictive accuracy, providing interpretable attention maps that reveal content-driven routing, and showing robustness to training data size and backbone choice. The work offers a mechanistic perspective on how visual information may be gated and expanded within the cortex, with implications for brain-inspired encoding and interactive image synthesis via BrainDiVE.

Abstract

A major goal of neuroscience is to understand brain computations during visual processing in naturalistic settings. A dominant approach is to use image-computable deep neural networks trained with different task objectives as a basis for linear encoding models. However, in addition to requiring estimation of a large number of linear encoding parameters, this approach ignores the structure of the feature maps both in the brain and the models. Recently proposed alternatives factor the linear mapping into separate sets of spatial and feature weights, thus finding static receptive fields for units, which is appropriate only for early visual areas. In this work, we employ the attention mechanism used in the transformer architecture to study how retinotopic visual features can be dynamically routed to category-selective areas in high-level visual processing. We show that this computational motif is significantly more powerful than alternative methods in predicting brain activity during natural scene viewing, across different feature basis models and modalities. We also show that this approach is inherently more interpretable as the attention-routing signals for different high-level categorical areas can be easily visualized for any input image. Given its high performance at predicting brain responses to novel images, the model deserves consideration as a candidate mechanistic model of how visual information from retinotopic maps is routed in the human brain based on the relevance of the input content to different category-selective regions.

Transformer brain encoders explain human high-level visual responses

TL;DR

The paper tackles how retinotopic visual information is routed to high-level category-selective brain regions during natural viewing. It introduces a transformer-based brain encoder that uses cross-attention to dynamically gate information from retinotopic maps to ROI queries, outperforming traditional encoders across multiple backbones and modalities. Key contributions include demonstrating superior predictive accuracy, providing interpretable attention maps that reveal content-driven routing, and showing robustness to training data size and backbone choice. The work offers a mechanistic perspective on how visual information may be gated and expanded within the cortex, with implications for brain-inspired encoding and interactive image synthesis via BrainDiVE.

Abstract

A major goal of neuroscience is to understand brain computations during visual processing in naturalistic settings. A dominant approach is to use image-computable deep neural networks trained with different task objectives as a basis for linear encoding models. However, in addition to requiring estimation of a large number of linear encoding parameters, this approach ignores the structure of the feature maps both in the brain and the models. Recently proposed alternatives factor the linear mapping into separate sets of spatial and feature weights, thus finding static receptive fields for units, which is appropriate only for early visual areas. In this work, we employ the attention mechanism used in the transformer architecture to study how retinotopic visual features can be dynamically routed to category-selective areas in high-level visual processing. We show that this computational motif is significantly more powerful than alternative methods in predicting brain activity during natural scene viewing, across different feature basis models and modalities. We also show that this approach is inherently more interpretable as the attention-routing signals for different high-level categorical areas can be easily visualized for any input image. Given its high performance at predicting brain responses to novel images, the model deserves consideration as a candidate mechanistic model of how visual information from retinotopic maps is routed in the human brain based on the relevance of the input content to different category-selective regions.

Paper Structure

This paper contains 21 sections, 16 figures, 10 tables.

Figures (16)

  • Figure 1: A. Brain encoder architecture. The input patches are first encoded using a frozen backbone model. The features are then mapped using a transformer decoder to brain responses. B. The cross attention mechanism showing how learned queries for each ROI can route only the relevant tokens to predict the vertices in the corresponding ROI.
  • Figure 2: A. The general region of interest for highly visually responsive vertices in the back of the brain shown on different surface maps. B. Encoding accuracy (fraction of explained variance) shown for Subject 1 for all the vertices for the transformer model using ROIs for decoder queries. C. Encoding accuracy for individual ROIs and for ROI clusters based on category selectivity for the two hemispheres. D. The differences in encoding accuracy between the transformer and the ridge regression models showing that improvement in the former is driven by better prediction of higher visual areas.
  • Figure 3: A. The encoding accuracy for subject 1 shown on the brain surface for the transformer model with vertices as decoder queries. B. The difference in encoding accuracies going from ROIs to vertices as the decoder queries shows the improvement is almost entirely from the early visual areas. C. The vertex-based transformer model outperforms the ridge regression model for almost all the ROIs.
  • Figure 4: A. Encoding accuracy of the transformer encoding model with vertex-based queries ensembled across backbone layers. B. Showing the backbone layer from which each vertex was best predicted. C. The improved performance of ensembling is almost entirely from better prediction of early visual areas.
  • Figure 5: Attention maps. Transformer decoder cross attention scores for three ROIs overlaid on the images (with brighter colors indicating higher attention weights). The selected ROIs show different ways in which the learned ROI queries can route information--- based on location (V2d), content (FBA), or a combination of the two (OFA) depending on the location of the ROI in the brain processing hierarchy.
  • ...and 11 more figures