Table of Contents
Fetching ...

Meta-Learning an In-Context Transformer Model of Human Higher Visual Cortex

Muquan Yu, Mu Nan, Hossein Adeli, Jacob S. Prince, John A. Pyles, Leila Wehbe, Margaret M. Henderson, Michael J. Tarr, Andrew F. Luo

TL;DR

The paper tackles the challenge of building generalizable, image-computable encoders for human higher visual cortex under substantial inter-subject variability and data constraints. It proposes BraInCoRL, a transformer-based meta-learning framework that performs in-context learning to infer voxelwise encoding functions from a few stimuli without any finetuning on new subjects, leveraging cross-subject data and context from multiple images. The approach demonstrates strong data efficiency and cross-dataset generalization (NSD and BOLD5000), reveals interpretable attention to category-relevant stimuli, and enables language-driven, zero-shot mappings to voxel selectivity. Overall, BraInCoRL provides a foundation model for fMRI encoders that supports rapid, subject-specific cortical mapping and has potential applications in clinical mapping and brain–computer interfaces.

Abstract

Understanding functional representations within higher visual cortex is a fundamental question in computational neuroscience. While artificial neural networks pretrained on large-scale datasets exhibit striking representational alignment with human neural responses, learning image-computable models of visual cortex relies on individual-level, large-scale fMRI datasets. The necessity for expensive, time-intensive, and often impractical data acquisition limits the generalizability of encoders to new subjects and stimuli. BraInCoRL uses in-context learning to predict voxelwise neural responses from few-shot examples without any additional finetuning for novel subjects and stimuli. We leverage a transformer architecture that can flexibly condition on a variable number of in-context image stimuli, learning an inductive bias over multiple subjects. During training, we explicitly optimize the model for in-context learning. By jointly conditioning on image features and voxel activations, our model learns to directly generate better performing voxelwise models of higher visual cortex. We demonstrate that BraInCoRL consistently outperforms existing voxelwise encoder designs in a low-data regime when evaluated on entirely novel images, while also exhibiting strong test-time scaling behavior. The model also generalizes to an entirely new visual fMRI dataset, which uses different subjects and fMRI data acquisition parameters. Further, BraInCoRL facilitates better interpretability of neural signals in higher visual cortex by attending to semantically relevant stimuli. Finally, we show that our framework enables interpretable mappings from natural language queries to voxel selectivity.

Meta-Learning an In-Context Transformer Model of Human Higher Visual Cortex

TL;DR

The paper tackles the challenge of building generalizable, image-computable encoders for human higher visual cortex under substantial inter-subject variability and data constraints. It proposes BraInCoRL, a transformer-based meta-learning framework that performs in-context learning to infer voxelwise encoding functions from a few stimuli without any finetuning on new subjects, leveraging cross-subject data and context from multiple images. The approach demonstrates strong data efficiency and cross-dataset generalization (NSD and BOLD5000), reveals interpretable attention to category-relevant stimuli, and enables language-driven, zero-shot mappings to voxel selectivity. Overall, BraInCoRL provides a foundation model for fMRI encoders that supports rapid, subject-specific cortical mapping and has potential applications in clinical mapping and brain–computer interfaces.

Abstract

Understanding functional representations within higher visual cortex is a fundamental question in computational neuroscience. While artificial neural networks pretrained on large-scale datasets exhibit striking representational alignment with human neural responses, learning image-computable models of visual cortex relies on individual-level, large-scale fMRI datasets. The necessity for expensive, time-intensive, and often impractical data acquisition limits the generalizability of encoders to new subjects and stimuli. BraInCoRL uses in-context learning to predict voxelwise neural responses from few-shot examples without any additional finetuning for novel subjects and stimuli. We leverage a transformer architecture that can flexibly condition on a variable number of in-context image stimuli, learning an inductive bias over multiple subjects. During training, we explicitly optimize the model for in-context learning. By jointly conditioning on image features and voxel activations, our model learns to directly generate better performing voxelwise models of higher visual cortex. We demonstrate that BraInCoRL consistently outperforms existing voxelwise encoder designs in a low-data regime when evaluated on entirely novel images, while also exhibiting strong test-time scaling behavior. The model also generalizes to an entirely new visual fMRI dataset, which uses different subjects and fMRI data acquisition parameters. Further, BraInCoRL facilitates better interpretability of neural signals in higher visual cortex by attending to semantically relevant stimuli. Finally, we show that our framework enables interpretable mappings from natural language queries to voxel selectivity.

Paper Structure

This paper contains 42 sections, 12 equations, 38 figures, 18 tables.

Figures (38)

  • Figure 1: BraInCoRL: Meta-Learning an In-Context Visual Cortex Encoder.(a) The voxelwise brain encoding problem. For each voxel, there is a response function that maps from visual stimuli to voxel activation. In practice, we can only observe the noisy measurements from fMRI. The goal is to infer an image-computable function for each voxel to predict its activation. (b) BraInCoRL treats each voxel as a meta-learning task, and samples (image, response) pairs from multiple subjects. During testing, the model is conditioned on a small number of novel images and measurements from a new subject and directly outputs the function parameters. (c) From left to right, the explained variance from the full model trained on 9,000 images from one subject, BraInCoRL with only $100$ in-context images from the new subject, and a baseline ridge regression also with $100$ images (for this baseline, voxelwise regularization is determined using 5-way cross-validation). Our method achieves much higher data efficiency than baseline. (d) Explained variance as a function of in‐context support set size. As the in-context support set size increases from 0 to 1,000, BraInCoRL steadily improves and approaches the fully trained reference model fit to converge on each subject’s full 9,000-image training set, demonstrating high prediction accuracy and data efficiency.
  • Figure 2: Architecture of the In-Context Voxelwise Encoder (BraInCoRL). (1) A pretrained feature extractor converts visual stimuli into vector embeddings. (2) A higher visual cortex transformer integrates these embeddings with voxel activations to learn context-specific features and generates hyperweights for a subsequent voxelwise encoder backbone. (3) The voxelwise encoder, conditioned on the hyperweights, predicts voxel responses for novel stimuli.
  • Figure 3: Evaluation on NSD.(a) Prediction explained variance of BraInCoRL improves on novel subjects with larger in-context support set size, outperforming within-subject ridge regression and approaching the fully trained reference model fit on each subject’s full 9,000-image training set, using far less data. (b) Ablation (100 support images) comparing BraInCoRL variants: the original model trained while holding out the novel subject's 9,000 test-time support images ("HO"), a BraInCoRL model trained without this holdout ("no HO"), and a pretraining-only BraInCoRL model, alongside the within-subject ridge baseline. Results show that finetuning with real fMRI data improves performance, and holding out the test subject’s image data does not hinder generalization. (c) Voxelwise explained variance from BraInCoRL (100 images) is strongly correlated with fully trained reference models across different visual encoder backbones. Note that the y-axis represents explained variance of the fully trained model (9,000 images), while x-axis represents explained variance of BraInCoRL.
  • Figure 4: UMAP visualization of predicted response weights. We apply UMAP to BraInCoRL -predicted voxelwise weights (100 support images) and show: (a) a flatmap for S1 with ROI outlines, (b) the same projection on an inflated surface, and (c) flatmaps for S2, S5, and S7. Color‐coded clusters align with body/face regions (EBA, FFA/aTL-faces), place regions (RSC, OPA, PPA), and food regions (in red).
  • Figure 5: Evaluation on BOLD5000. We evaluate BraInCoRL on the BOLD5000 dataset, which was collected using a different scanner than NSD. For varying in-context support set sizes, we report voxelwise Pearson correlation between predicted and true responses for both BraInCoRL and within-subject ridge regression. BraInCoRL achieves higher accuracy and greater data efficiency.
  • ...and 33 more figures