Representational drift changes the encoding of fast and slow-varying natural scene features differently

Siwei Wang; Elizabeth A de Laittre; Jason MacLean; Stephanie E Palmer

Representational drift changes the encoding of fast and slow-varying natural scene features differently

Siwei Wang, Elizabeth A de Laittre, Jason MacLean, Stephanie E Palmer

TL;DR

This work investigates the differences in representational drift across spatiotemporal features in a moving visual stimulus and learns a latent space embedding using weakly supervised contrastive learning that is near-optimal for decoding natural features and neural activity from novel animals.

Abstract

Representational drift refers to an unstable mapping between neural activity and input sensory or output behavioral variables. While much work has focused on the effect of representational drift on single, simple external variables, we investigate the differences in representational drift across spatiotemporal features in a moving visual stimulus. The neural responses across animals to the same movie reflect both common, encoded stimulus features and idiosyncratic individual variation. To extract the shared neural encoding of stimulus features only, we learn a latent space embedding using weakly supervised contrastive learning. This approach pulls neural activity together in the embedding space if they are responses to the same stimulus segment and push them apart if not. This approach enables us to probe how stimulus features fluctuating as fast as 33 ms (the movie frame rate) are encoded by variable neural codes across animals. It also allows us to investigate how representational drift changes the encoding in individuals across sessions. We observe that our learned embedding is near-optimal for decoding natural features (background scenery, local motion, complex spatio-temporal features, and time) and neural activity from novel animals. This suggests that our embedding retains the encoding of multiple features at higher temporal granularity compared to previous methods. To quantify representational drift, we apply the trained decoder (which achieves near-optimal performance in one session) to a subsequent session recorded 90 minutes later. We then use the decrease in decoding performance as a proxy for the magnitude of drift. We show that the drift changes the encoding of fast-varying local motion features at a rate 5-6 times higher than slower-varying scenery features. Drift also perturbs the local geometry in the embedding.

Representational drift changes the encoding of fast and slow-varying natural scene features differently

TL;DR

Abstract

Paper Structure (10 sections, 3 theorems, 8 equations, 16 figures, 4 tables)

This paper contains 10 sections, 3 theorems, 8 equations, 16 figures, 4 tables.

The visual stimulus contains features with multiple spatiotemporal scales
Learning a generalizable representation from weakly supervised, cross-modality contrastive training
Quantifying the magnitude of representational drift via changes in decoding performance
Representational drift disrupts the decoding of local motion features through perturbations to the embedding geometry
Obtaining scenery/local motion features of the movie based on hierarchical clustering
Settings of the weakly supervised contrastive learning
Dataset and Training details
Calculation of temporal distance between the predicted and true feature
Theoretical framework for interpreting near-optimal decoding and characterizing embedding geometry
Supplementary Results

Key Result

Lemma 1

Let Z-X-Y be a Markov chain of random variables. For any loss function $l$,

Figures (16)

Figure 1: Local motion and scenery features fluctuate at different timescales. A) When we cluster static scenery frames and local motion frames separately, we find that frames with similar scenery content (sharing the same clustering label) often contain different local motion content (as determined by clustering procedures independently applied to scenery and local motion frames; see main text and Supplementary Materials \ref{['supp:clustering']}). The three example static frames (top) all share the same scenery label, yet their corresponding local motion frames at time $t$ (bottom, they are calculated as the difference between frame $t$ and $t+1$) each have a different local motion label. B) Autocorrelation decays differently for scenery and local motion labels (schematized at the top, with actual data below). Local motion features decay rapidly, reaching near-zero correlation after just 33-100 ms (1-3 frames). In contrast, scenery features show a much slower decay, reaching zero only after 400-500 ms (12-15 frames) and a negative peak around 1s (likely corresponding to scene changes in the movie). This analysis uses the first 400 frames (first half of the movie), with similar decay timescale patterns observed in the second half of the movie (see Supplementary Materials \ref{['autocorr2']}).
Figure 2: Weakly supervised contrastive learning extracts features from the movie and neural code.. Our method uses temporal co-occurrence to bring together neural and visual representations from the same temporal window while separating those from different windows. The approach consists of two phases: First, single-modality training in which separate ResNet50 networks independently learn embeddings for neural activity and the visual stimulus. Second, cross-modality training that aligns the output embeddings between these ResNet50 networks. The goal here is to refine the learned embeddings so that samples from any modality that correspond to the same time bin are pulled close together and samples that occur at different time bins are pushed apart (each cluster of blue or green dots represents samples of different modalities from the same time, are pulled together). As a result, it produces a new, modified embedding with shared decision boundaries for stimulus features (e.g., times $t_1$ and $t_2$ shown in green and blue) while matching their supports across modalities (lower right panel). This support match ensures that the stimulus features in the embedding learned from neural activity precisely correspond to those in the visual stimulus and vice versa. The alignment enables assessment of the encoding of complex natural features without explicit parameterization. See Supplementary Materials \ref{['alltoall']}.
Figure 3: The learned embedding achieves near-optimal linear decoding performance after both single-modality and cross-modality learning phases. A) We evaluate our embedding by training a linear decoder $WX+b$ to distinguish embeddings from the correct time frame $t_i$ versus all alternative time frames $t_j$. With $n$ total frames, each correct frame must be distinguished from $n-1$ alternative frames. The chance level for this task is $1/n$ with $n=400$, i.e., 0.25%. B) The linear decoder learns multiple decision boundaries (hyperplanes) that partition the high-dimensional embedding space. In an optimal embedding (top schematic), neural activity from different time points forms distinct clusters, allowing these hyperplanes to create unique partitions for each time frame, enabling perfect linear separability. When embeddings overlap (bottom schematic), these hyperplanes cannot establish unique subspaces for each time point, resulting in decoding errors. C) Decoding performance comparison between the single-modality and cross-modality training phases, using either trial-averaged neural population responses (PSTHs, as held-out test sets) or single-trial activity. The embedding achieves approximately 99% accuracy using PSTH data and 92-93% using single-trial data when decoding frame numbers at 33 ms resolution. Additional decoding results are available in Supplementary Materials \ref{['supp:baseline']}. Because we have 80,000 samples in our held-out test dataset, the standard errors of the decoding performance here are on the order of 0.1% (not shown)
Figure 4: Representational drift affects the encoding of stimulus features differently depending on their spatiotemporal statistics. A) Overall effect of representational drift on stimulus feature decoding: We pretrained our model using neural activity from session 1 (training set) and froze the weights. We then trained linear decoders for four natural features using neural representations from session 1. Next, we generated two sets of neural representations: those without drift by mapping held-out test data from session 1 into the pretrained model, and those with drift by mapping neural activity from session 2 into the same model. The dashed lines show decoding performance without drift (approximately 99% for all features, also shown in Fig. \ref{['figure:timedecode']}), while the bar plots show performance with drift, revealing significant degradation across all features. Note that standard errors of these accuracies scale inversely with sample size. Because we use 80,000 samples in both the held-out test set from session 1 (without drift) and the dataset from session 2 (with drift), the standard errors of the decoding performance are negligibly small (on the order of 0.1--0.3%. Consequently, differences in decoding between stimulus features (e.g., Scenery vs Local Motion) are also significant ($p < 0.0001$).). B) The drift rates between local motion and scenery features are different. Here we use the decoding error per frame as a proxy of this drift rate. This decoding error per frame is defined as the temporal distance between frames where the predicted and true features appear (see Supplementary Materials \ref{['method:temporal']}). The bottom inset illustrates how we compute this temporal distance. The decoded feature (round) is 3 frames (100 ms) away from the correct feature (square), thus, this decoding error has a temporal distance of 3. Here we plot the fraction of decoding errors with temporal distance $n$ as $n$ changes from 33 ms to 133 ms (1 to 4 frames off). The top inset shows the aggregated decoding errors within three time windows (by summing all frames within those windows): $(0, \tau_1)$ ($\tau_1 = 66$ ms), $(\tau_1, \tau_2)$ ($\tau_2 = 500$ ms), and $(\tau_2, \tau_3)$ ($\tau_3 = 1000$ ms, defined in Results. \ref{['sec:movie']}).
Figure 5: Representational drift perturbs the local geometry needed for decoding local motion features.A) The learned embedding without representational drift maintains a precise $K$-nearest neighbor structure for temporally adjacent frames within a local neighborhood ($k \in (0,4)$). The cyan line shows the proportion of test samples that maintain a local $k$-nearest neighborhood structure for $k \in (0,4)$. For a neural activity sample occurring at time $t$, its embedding belongs to the $k$-nearest neighborhood if its $k$-th nearest neighbor in the embedding space is a cluster corresponding to a time within $(t-k, t+k)$. At $k=0$, 99% of test samples are correctly classified when using the class means as the linear decoder weights (see Supplementary Materials \ref{['appendix:NC']}). At $k=1$, 96% of test samples have their second nearest cluster (after their own time cluster) corresponding to either $t+1$ or $t-1$, preserving temporal adjacency. The magenta line shows how this same $k$-nearest neighborhood metric changes in the presence of representational drift, revealing substantial degradation. The dashed line indicates chance-level performance ($0.025$), and the gap between this line and the cyan/magenta curves reflects the preserved global geometric smoothness (see Supplementary Materials \ref{['sec:optimalgeometry']}). The inset demonstrates that no significant difference exists when expanding the neighborhood beyond $k=4$, indicating that representational drift disrupts only local geometry. B and C Illustration of local geometry without drift B and with drift C. Without drift, features from different frames (e.g., $t$ and $t+i$) form well-separated clusters with regular shapes and consistent spacing. The $K$-nearest neighbor property ensures that temporally close neighbors (e.g., $t+1$ or $t+2$) are spatially closer than temporally distant ones (gray circles show the cluster of $t+2$ is approximately twice as far as the cluster of $t+1$). C shows that representational drift disrupts this organized structure by changing both cluster spreads and inter-cluster distances.
...and 11 more figures

Theorems & Definitions (5)

Lemma 1: Generalized data processing inequality Cover2006 for Bayes risk Xu2020Dubois2021
Lemma 2: Equivalence of optimal Bayesian risk and support match
Definition 1: Bayesian risk for idealized domain generalization between neural activity and natural movie
Theorem 1: Optimal representation in cross-modality contrastive learning
Definition 2: $K$-Simplex ETF

Representational drift changes the encoding of fast and slow-varying natural scene features differently

TL;DR

Abstract

Representational drift changes the encoding of fast and slow-varying natural scene features differently

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (5)