Table of Contents
Fetching ...

Animal behavioral analysis and neural encoding with transformer-based self-supervised pretraining

Yanchen Wang, Han Yu, Ari Blau, Yizi Zhang, The International Brain Laboratory, Liam Paninski, Cole Hurwitz, Matt Whiteway

TL;DR

This work introduces BEAST, a novel and scalable framework that pretrains experiment-specific vision transformers for diverse neuro-behavior analyses and establishes a powerful and versatile backbone model that accelerates behavioral analysis in scenarios where labeled data remains scarce.

Abstract

The brain can only be fully understood through the lens of the behavior it generates -- a guiding principle in modern neuroscience research that nevertheless presents significant technical challenges. Many studies capture behavior with cameras, but video analysis approaches typically rely on specialized models requiring extensive labeled data. We address this limitation with BEAST(BEhavioral Analysis via Self-supervised pretraining of Transformers), a novel and scalable framework that pretrains experiment-specific vision transformers for diverse neuro-behavior analyses. BEAST combines masked autoencoding with temporal contrastive learning to effectively leverage unlabeled video data. Through comprehensive evaluation across multiple species, we demonstrate improved performance in three critical neuro-behavioral tasks: extracting behavioral features that correlate with neural activity, and pose estimation and action segmentation in both the single- and multi-animal settings. Our method establishes a powerful and versatile backbone model that accelerates behavioral analysis in scenarios where labeled data remains scarce.

Animal behavioral analysis and neural encoding with transformer-based self-supervised pretraining

TL;DR

This work introduces BEAST, a novel and scalable framework that pretrains experiment-specific vision transformers for diverse neuro-behavior analyses and establishes a powerful and versatile backbone model that accelerates behavioral analysis in scenarios where labeled data remains scarce.

Abstract

The brain can only be fully understood through the lens of the behavior it generates -- a guiding principle in modern neuroscience research that nevertheless presents significant technical challenges. Many studies capture behavior with cameras, but video analysis approaches typically rely on specialized models requiring extensive labeled data. We address this limitation with BEAST(BEhavioral Analysis via Self-supervised pretraining of Transformers), a novel and scalable framework that pretrains experiment-specific vision transformers for diverse neuro-behavior analyses. BEAST combines masked autoencoding with temporal contrastive learning to effectively leverage unlabeled video data. Through comprehensive evaluation across multiple species, we demonstrate improved performance in three critical neuro-behavioral tasks: extracting behavioral features that correlate with neural activity, and pose estimation and action segmentation in both the single- and multi-animal settings. Our method establishes a powerful and versatile backbone model that accelerates behavioral analysis in scenarios where labeled data remains scarce.

Paper Structure

This paper contains 89 sections, 12 figures, 15 tables.

Figures (12)

  • Figure 1: Beast framework.A: Our self-supervised pretraining framework Beast combines masked autoencoding he2022masked with temporal contrastive learning chen2020simple. An anchor frame at time t is paired with a positive frame from $t\pm1$, while more distant frames from the same video, or frames from other videos, serve as negative examples. Frames are divided into patches, with most patches randomly masked. A vision transformer (ViT) processes the remaining patches, which must reconstruct all patches. The ViTCLS tokens, which serve as a global representation of each frame, are nonlinearly projected to a new space where the contrastive loss pulls anchor-positive pairs together and pushes anchor-negative pairs apart. B:Beast supports various downstream neuro-behavioral tasks including neural encoding, pose estimation, and action segmentation.
  • Figure 2: Beast improves neural encoding.A: Example video frame from each dataset. B: Encoding performance is evaluated across multiple baseline features with both linear models (hatched bars; reduced rank regression, RRR) and nonlinear models (solid bars; temporal convolution network, TCN). CEBRA uses a contrastive loss to embed video frames in a latent feature space. "Motion energy" for the IBL-whisker dataset is a 1D estimate of movement calculated as the sum of the absolute pixel differences between successive frames. Beast features outperform all baselines in both linear and nonlinear regimes. Error bars show standard error of the mean (S.E.M.) of Bits per Spike (BPS) across $N$=842 neurons from five test sessions (IBL and IBL-whisker) or S.E.M. of variance explained of the principal components of the neural activity across five test sessions (Facemap; see text). C: Scatterplot comparison of Beast vs keypoint-based model performance in an example session. Each dot corresponds to an individual neuron. The values in the bottom-right corner represent the session-averaged BPS. D:Top, middle: comparison of the predicted trial-averaged firing rates for Beast and keypoints (lines) and single-trial variability obtained by subtracting the neuron's average firing rate on each trial (heatmaps). Bottom: comparison of predicted neural principal components for the Facemap dataset.
  • Figure 3: Beast improves pose estimation.A: Example frame from each dataset overlaid with ground truth annotations. Green stars indicate the highlighted keypoint in panel B. B: Example traces from the ResNet-50 (gray) and Beast (green) models for a single keypoint in a held-out video. Beast traces evolve more smoothly in time and do not contain erroneous jumps like the ResNet-50 baseline. C: Pixel error as a function of keypoint difficulty (see main text; smaller is better): left-hand side shows performance across all keypoints; moving to the right drops the easier keypoints defined by inter-seed and -model prediction variance. Vertical dashed lines indicate the percentage of data used for the pixel error computation. ViT-M (IN) is a ViT backbone pretrained on ImageNet with a masked autoencoding loss; ViT-M (IN+PT) uses the same architecture and loss but is initialized with ImageNet-pretrained weights then further pretrained on experiment-specific unlabeled frames. ViT-C (IN+PT) performs the experiment-specific pretraining using the temporal contrastive loss only.
  • Figure 4: Beast improves action segmentation.A: Example frame from each dataset; performance evaluated across multiple baseline features with both TCN (solid) and ensembled (hatched) models. Error bars represent standard error of the mean across five random initializations. B: Confusion matrices for TCN models based on keypoints and Beast patch embeddings. C: Example behavior sequences with feature traces (single seed shown for Beast models), ensemble probabilities, ensembled model ethograms, ground truth ethograms, and error frames. PCs of SimBA and Beast features are shown for illustration, but the models utilize the full feature set.
  • Figure 5: Effect of Batch Normalization on contrastive training accuracy. Training contrastive accuracy improves significantly with the use of Batch Normalization (BatchNorm) in the nonlinear projection head. Models trained with BatchNorm exhibit smoother learning curves and achieve higher final accuracy compared to those without BatchNorm. "Accuracy" is defined as the fraction of anchor frames in a batch where the corresponding positive frame has a logit score higher than that of all other negative frames.
  • ...and 7 more figures