Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models

Rining Wu; Feixiang Zhou; Ziwei Yin; Jian K. Liu

Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models

Rining Wu, Feixiang Zhou, Ziwei Yin, Jian K. Liu

TL;DR

This study introduces Vi-ST, a spatiotemporal model that aligns dynamic natural scenes with retinal ganglion cell coding by fusing a self-supervised Vision Transformer prior with a causal spatiotemporal CNN and RF-informed conditioning. The approach achieves superior cross-video generalization for predicting RGC spike trains and demonstrates the value of temporal-aware loss (Vi-ST loss) and population-coding perspectives. Ablation analyses show the Spike Alignment module and ViT prior as key drivers, while experiments on complementary coding highlight the benefits of larger encoding spaces for neural prediction. The work provides a framework for temporally coherent brain–video mappings and suggests broad applicability to neural encoding beyond the retina.

Abstract

Our brains represent the ever-changing environment with neurons in a highly dynamic fashion. The temporal features of visual pixels in dynamic natural scenes are entrapped in the neuronal responses of the retina. It is crucial to establish the intrinsic temporal relationship between visual pixels and neuronal responses. Recent foundation vision models have paved an advanced way of understanding image pixels. Yet, neuronal coding in the brain largely lacks a deep understanding of its alignment with pixels. Most previous studies employ static images or artificial videos derived from static images for emulating more real and complicated stimuli. Despite these simple scenarios effectively help to separate key factors influencing visual coding, complex temporal relationships receive no consideration. To decompose the temporal features of visual coding in natural scenes, here we propose Vi-ST, a spatiotemporal convolutional neural network fed with a self-supervised Vision Transformer (ViT) prior, aimed at unraveling the temporal-based encoding patterns of retinal neuronal populations. The model demonstrates robust predictive performance in generalization tests. Furthermore, through detailed ablation experiments, we demonstrate the significance of each temporal module. Furthermore, we introduce a visual coding evaluation metric designed to integrate temporal considerations and compare the impact of different numbers of neuronal populations on complementary coding. In conclusion, our proposed Vi-ST demonstrates a novel modeling framework for neuronal coding of dynamic visual scenes in the brain, effectively aligning our brain representation of video with neuronal activity. The code is available at https://github.com/wurining/Vi-ST.

Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models

TL;DR

Abstract

Paper Structure (26 sections, 4 equations, 6 figures, 2 tables)

This paper contains 26 sections, 4 equations, 6 figures, 2 tables.

Introduction
Related Work
Retinal Recurrent Connection Mechanism
Image and Video Models
Self-supervised Foundational Vision Model
3D CNN-based temporal modeling
Methodology
Video Features Extractor
Spikes Alignment
Loss Function
Experiments
Dataset
RGCs Responses
Natural Scene Stimuli
Metrics
...and 11 more sections

Figures (6)

Figure 1: The architecture of our proposed Vi-ST model. The model comprises two primary components: a Video Feature Extractor based on the ViT prior, and a Spikes Alignment module. The original video is treated as a stack of frames, and spatial features are independently computed for each frame using pre-trained ViT. These features are packaged into a spatiotemporal cube of length $t$, which is subsequently processed by a Causal 3D Temporal Convolutional Network (C3TCN) to model spatiotemporal dynamics. The Receptive Field information of RGCs is utilized as a conditioning factor, fused with the spatiotemporal features through the 3D AdaLN Zero module, and subsequently fed into a series of Causal Multiscale Spatiotemporal (CMST) blocks. These CMST modules are employed for modeling the neuronal activity of a specific number of RGC neurons, ultimately yielding final spike predictions through a linear mapping.
Figure 2: Vi-ST Submodules. (a) The detailed architecture of the video features extractor. (b) The Causal Multiscale Spatiotemporal module (CMST) in the spikes alignment module.
Figure 3: RGC durations. The grey line represents the high-frequency durations of RGCs, where we manually identified several peaks. For entirely including a complete duration, we choose a truncation position slightly lagging behind the peak.
Figure 4: RGC Prediction Results. We visualized two example cells showing representative prediction results. By comparing the response results of different models, we can intuitively understand the relationship between the metric and the actual response. More examples can be found in the Supplementary Materials.
Figure 5: Loss Function Comparison and DINOv2 Prior Comparison with CC (left y-axis) and SD-KL (right y-axis, representing by red triangles) (a) The comparison of Vi-ST loss and RMSE loss, where Mov1 represents training on Mov1 and testing on Mov2, and vice versa. (b) The comparison of different DINOv2 priors, where the layer number of DINOv2 are represented as $L-n$, e.g. $L-1$ represents the first layer.
...and 1 more figures

Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models

TL;DR

Abstract

Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)