Table of Contents
Fetching ...

Neuroformer: Multimodal and Multitask Generative Pretraining for Brain Data

Antonis Antoniades, Yiyi Yu, Joseph Canzano, William Wang, Spencer LaVere Smith

TL;DR

Neuroscience data are increasingly multimodal and high-dimensional, demanding scalable analysis tools. Neuroformer reframes neural data analysis as a spatiotemporal autoregressive generation problem and uses a multimodal, multitask transformer pretrained with a contrastive objective to align neural activity, stimuli, and behavior. The approach yields accurate spike predictions, reveals directional connectivity via attention, and enables rapid few-shot behavioral decoding, demonstrated on simulated networks and real two-photon imaging data. This work lays groundwork for scalable, foundation-model–like analysis in systems neuroscience and suggests pathways to integrate neural data modeling with advances in large language models.

Abstract

State-of-the-art systems neuroscience experiments yield large-scale multimodal data, and these data sets require new tools for analysis. Inspired by the success of large pretrained models in vision and language domains, we reframe the analysis of large-scale, cellular-resolution neuronal spiking data into an autoregressive spatiotemporal generation problem. Neuroformer is a multimodal, multitask generative pretrained transformer (GPT) model that is specifically designed to handle the intricacies of data in systems neuroscience. It scales linearly with feature size, can process an arbitrary number of modalities, and is adaptable to downstream tasks, such as predicting behavior. We first trained Neuroformer on simulated datasets, and found that it both accurately predicted simulated neuronal circuit activity, and also intrinsically inferred the underlying neural circuit connectivity, including direction. When pretrained to decode neural responses, the model predicted the behavior of a mouse with only few-shot fine-tuning, suggesting that the model begins learning how to do so directly from the neural representations themselves, without any explicit supervision. We used an ablation study to show that joint training on neuronal responses and behavior boosted performance, highlighting the model's ability to associate behavioral and neural representations in an unsupervised manner. These findings show that Neuroformer can analyze neural datasets and their emergent properties, informing the development of models and hypotheses associated with the brain.

Neuroformer: Multimodal and Multitask Generative Pretraining for Brain Data

TL;DR

Neuroscience data are increasingly multimodal and high-dimensional, demanding scalable analysis tools. Neuroformer reframes neural data analysis as a spatiotemporal autoregressive generation problem and uses a multimodal, multitask transformer pretrained with a contrastive objective to align neural activity, stimuli, and behavior. The approach yields accurate spike predictions, reveals directional connectivity via attention, and enables rapid few-shot behavioral decoding, demonstrated on simulated networks and real two-photon imaging data. This work lays groundwork for scalable, foundation-model–like analysis in systems neuroscience and suggests pathways to integrate neural data modeling with advances in large language models.

Abstract

State-of-the-art systems neuroscience experiments yield large-scale multimodal data, and these data sets require new tools for analysis. Inspired by the success of large pretrained models in vision and language domains, we reframe the analysis of large-scale, cellular-resolution neuronal spiking data into an autoregressive spatiotemporal generation problem. Neuroformer is a multimodal, multitask generative pretrained transformer (GPT) model that is specifically designed to handle the intricacies of data in systems neuroscience. It scales linearly with feature size, can process an arbitrary number of modalities, and is adaptable to downstream tasks, such as predicting behavior. We first trained Neuroformer on simulated datasets, and found that it both accurately predicted simulated neuronal circuit activity, and also intrinsically inferred the underlying neural circuit connectivity, including direction. When pretrained to decode neural responses, the model predicted the behavior of a mouse with only few-shot fine-tuning, suggesting that the model begins learning how to do so directly from the neural representations themselves, without any explicit supervision. We used an ablation study to show that joint training on neuronal responses and behavior boosted performance, highlighting the model's ability to associate behavioral and neural representations in an unsupervised manner. These findings show that Neuroformer can analyze neural datasets and their emergent properties, informing the development of models and hypotheses associated with the brain.
Paper Structure (47 sections, 5 equations, 15 figures, 3 tables)

This paper contains 47 sections, 5 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Neuroformer architecture. Inputs undergo contrastive matching to ensure efficient representations for downstream processing. Neural activity, stimuli, and other modalities are added to the Current State using cascading self- and cross-attention modules. Finally, the output of spike predictions (neuron ID and time interval tokens) are read out. During inference time, neuron IDs autoregressively populate the block, and the corresponding predicted time tokens act as additive temporal encodings.
  • Figure 2: Validating Neuroformer with ground truth data. (a) A model spiking neural network was generated to provide ground truth for validation. To provide a salient network feature, three "hub" neurons are strongly connected to many other neurons. (b) The ground truth connectivity matrix provides a reference for the network. (c) Raster plots (top) for neurons show that Neuroformer provides spike predictions (red) that closely match the spiking of the ground truth network (black). Correlations between the predicted and ground truth spike trains were generally positive, and >0.1 (bottom). (d) A simple Pearson correlation analysis reveals three subnetworks, but not the hub neurons, because of the lack of directionality or causality. (e) Attention mechanisms in Neuroformer infer causality and thus reveal the hub neurons.
  • Figure 3: Demonstrating utility of Neuroformer with real data. (a) We measured neural activity in two visual cortical areas, V1 and AL, in mice using large field-of-view two-photon calcium imaging. After screening for reliable neuronal responses to stimuli, 386 neurons were included for analysis. (b) Neuroformer modeled the spiking data accurately, for responses to two classes of visual stimuli: gratings and naturalistic videos (black ground-truth, red generated). (c) Neuroformer vs. GLM population response prediction comparison. Our model's predictions where more correlated with the ground-truth (t-test, $p=0.0196$) (d) Attention provided data-driven maps of the parts of stimuli that had statistical dependencies with neuronal responses.
  • Figure 4: Raw Speed vs. Predicted for all methods for medial (top) and lateral (bottom).
  • Figure 5: (a) Velocity prediction vs. Ground Truth. Neuroformer's predictions were highly correlated (pearson $r_{1\%, full}=0.97$) with the actual speed of the mouse. (b) Pretrained neuronal models exhibit few-shot learning of mouse speed. The pretrained model finetuned using $1\%$ of speed data was able to significantly outperform a non-pretrained equivalent trained on $10\%$ of the data. (pearson $r_{1\%, ft}=0.51$ vs. $r_{10\%, no pre}=0.33$ respectively) (c) Behavior loss. The pretrained models (green) were able to converge faster and to a lower testing loss compared to the non-pretrained equivalents.
  • ...and 10 more figures