Transformer brain encoders explain human high-level visual responses

Hossein Adeli; Sun Minni; Nikolaus Kriegeskorte

Transformer brain encoders explain human high-level visual responses

Hossein Adeli, Sun Minni, Nikolaus Kriegeskorte

TL;DR

The paper tackles how retinotopic visual information is routed to high-level category-selective brain regions during natural viewing. It introduces a transformer-based brain encoder that uses cross-attention to dynamically gate information from retinotopic maps to ROI queries, outperforming traditional encoders across multiple backbones and modalities. Key contributions include demonstrating superior predictive accuracy, providing interpretable attention maps that reveal content-driven routing, and showing robustness to training data size and backbone choice. The work offers a mechanistic perspective on how visual information may be gated and expanded within the cortex, with implications for brain-inspired encoding and interactive image synthesis via BrainDiVE.

Abstract

A major goal of neuroscience is to understand brain computations during visual processing in naturalistic settings. A dominant approach is to use image-computable deep neural networks trained with different task objectives as a basis for linear encoding models. However, in addition to requiring estimation of a large number of linear encoding parameters, this approach ignores the structure of the feature maps both in the brain and the models. Recently proposed alternatives factor the linear mapping into separate sets of spatial and feature weights, thus finding static receptive fields for units, which is appropriate only for early visual areas. In this work, we employ the attention mechanism used in the transformer architecture to study how retinotopic visual features can be dynamically routed to category-selective areas in high-level visual processing. We show that this computational motif is significantly more powerful than alternative methods in predicting brain activity during natural scene viewing, across different feature basis models and modalities. We also show that this approach is inherently more interpretable as the attention-routing signals for different high-level categorical areas can be easily visualized for any input image. Given its high performance at predicting brain responses to novel images, the model deserves consideration as a candidate mechanistic model of how visual information from retinotopic maps is routed in the human brain based on the relevance of the input content to different category-selective regions.

Transformer brain encoders explain human high-level visual responses

TL;DR

Abstract

Transformer brain encoders explain human high-level visual responses

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)