BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity

Andrew F. Luo; Margaret M. Henderson; Michael J. Tarr; Leila Wehbe

BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity

Andrew F. Luo, Margaret M. Henderson, Michael J. Tarr, Leila Wehbe

TL;DR

BrainSCUBA presents a data-driven framework to generate voxel-wise natural language captions describing semantic selectivity in the human visual cortex. By coupling a frozen CLIP image encoder with a linear voxel projection and a softmax-based mapping into natural-image CLIP space, the method enables per-voxel captions generated by a captioning model, without voxel-caption supervision. The approach yields captions that align with known category-selective regions, enables text-conditioned diffusion-based image synthesis, and reveals fine-grained semantic structure in regions such as EBA, FFA, RSC, OPA, PPA, and even TPJ/PCV in the context of social content. This voxel-level, language-grounded framework offers a scalable path for data-driven discoveries about functional specialization in higher visual areas and supports hypothesis-driven investigations into brain semantics.

Abstract

Understanding the functional organization of higher visual cortex is a central focus in neuroscience. Past studies have primarily mapped the visual and semantic selectivity of neural populations using hand-selected stimuli, which may potentially bias results towards pre-existing hypotheses of visual cortex functionality. Moving beyond conventional approaches, we introduce a data-driven method that generates natural language descriptions for images predicted to maximally activate individual voxels of interest. Our method -- Semantic Captioning Using Brain Alignments ("BrainSCUBA") -- builds upon the rich embedding space learned by a contrastive vision-language model and utilizes a pre-trained large language model to generate interpretable captions. We validate our method through fine-grained voxel-level captioning across higher-order visual regions. We further perform text-conditioned image synthesis with the captions, and show that our images are semantically coherent and yield high predicted activations. Finally, to demonstrate how our method enables scientific discovery, we perform exploratory investigations on the distribution of "person" representations in the brain, and discover fine-grained semantic selectivity in body-selective areas. Unlike earlier studies that decode text, our method derives voxel-wise captions of semantic selectivity. Our results show that BrainSCUBA is a promising means for understanding functional preferences in the brain, and provides motivation for further hypothesis-driven investigation of visual cortex.

BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity

TL;DR

Abstract

Paper Structure (31 sections, 4 equations, 27 figures, 7 tables)

This paper contains 31 sections, 4 equations, 27 figures, 7 tables.

Introduction
Related Work
Semantic Selectivity in Higher Visual Cortex.
Image-Captioning with CLIP and Language Models.
Brain-Conditioned Image and Caption Generation.
Methods
Image-to-Brain Encoder Construction
Deriving the Optimal Embedding and Closing the Gap
Results
Setup
Voxel-Wise Text Generations
Text-Guided Brain Image Synthesis
Investigating the Brain's Social Network
Discussion
Limitations and Future Work.
...and 16 more sections

Figures (27)

Figure 1: Architecture of BrainSCUBA. (a) Our framework relies on an fMRI encoder trained to map from images to voxel-wise brain activations. The encoder consists of a frozen CLIP image network with a unit norm output and a linear probe. (b) We decode the voxel-wise weights by projecting the weights into the space of CLIP embeddings for natural images followed by sentence generation. (c) Select sentences from each region, please see experiments for a full analysis.
Figure 2: Projection of fMRI encoder weights. (a) We validate the encoder $R^2$ on a test set, and find it can achieve high accuracy in the higher visual cortex. (b) The joint-UMAP of image CLIP embeddings, and pre-/post-projection of the encoder. All embeddings are normalized before UMAP. (c) We measure the average cosine similarity between pre-/post-projection weights, and find it increases as the images used are increased. Standard deviation of 5 projections shown in light blue.
Figure 3: Interpreting the nouns generated by BrainSCUBA . We take the projected encoder weights and fit a UMAP transform that goes to $4$-dims. (a) The 50 most common noun embeddings across the brain are projected & transformed using the fMRI UMAP. (b) Flatmap of S1 with ROIs labeled. (c) Inflated view of S1. (d) Flatmaps of S2, S5, S7. We find that BrainSCUBA nouns are aligned to previously identified functional regions. Shown here are body regions (EBA), face regions (FFA-1/FFA-2/aTL-faces), place regions (RSC/OPA/PPA). Note that the yellow near FFA match the food regions identified by Jain2023. The visualization style is inspired by huth2016natural.
Figure 4: Top BrainSCUBA nouns via voxel-wise captioning in broad category selective regions. We perform part-of-speech tagging and lemmatization to extract the nouns, $y-$axis normalized by voxel count. We find that the generated captions are semantically related to the functional selectivity of broad category selective regions. Note that the word "close" tended to appear in the noun phrase "close-up", which explains its high frequency in the captions from food- and word-selective voxels.
Figure 5: Novel images for category selective voxels in S2. We visualize the top-5 images from the fMRI stimuli and generated images for the place/word/face/body regions, and the top-10 images for the food region. We observe that images generated with BrainSCUBA appear more coherent.
...and 22 more figures

BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity

TL;DR

Abstract

BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity

Authors

TL;DR

Abstract

Table of Contents

Figures (27)