Table of Contents
Fetching ...

Differentiable Optimization of Similarity Scores Between Models and Brains

Nathan Cloos, Moufan Li, Markus Siegel, Scott L. Brincat, Earl K. Miller, Guangyu Robert Yang, Christopher J. Cueva

TL;DR

The paper tackles the interpretability of representational similarity metrics used to compare models and brains by differentiating through these measures to maximize similarity. It introduces a differentiable optimization framework that yields synthetic datasets Y aligned with neural data X under various metrics (CKA, angular CKA, angular Procrustes, NBS, and regression-based scores) and assesses whether high similarity equates to task-relevant encoding. The study reveals that the meaning of a ‘good’ score is metric- and dataset-dependent, and that high similarity does not guarantee neural-consistent encoding, with CKA biased toward high-variance components. The authors also derive theoretical relationships showing CKA’s quadratic dependence on high-variance PCs versus NBS’s linear dependence, and demonstrate how jointly optimizing multiple metrics defines feasible score ranges, underscoring the need for careful interpretation and providing open-source tooling for standardization. Together, these findings offer a more nuanced framework for using similarity measures in neuroscience and AI, and tools to benchmark and interpret future metrics.

Abstract

How do we know if two systems - biological or artificial - process information in a similar way? Similarity measures such as linear regression, Centered Kernel Alignment (CKA), Normalized Bures Similarity (NBS), and angular Procrustes distance, are often used to quantify this similarity. However, it is currently unclear what drives high similarity scores and even what constitutes a "good" score. Here, we introduce a novel tool to investigate these questions by differentiating through similarity measures to directly maximize the score. Surprisingly, we find that high similarity scores do not guarantee encoding task-relevant information in a manner consistent with neural data; and this is particularly acute for CKA and even some variations of cross-validated and regularized linear regression. We find no consistent threshold for a good similarity score - it depends on both the measure and the dataset. In addition, synthetic datasets optimized to maximize similarity scores initially learn the highest variance principal component of the target dataset, but some methods like angular Procrustes capture lower variance dimensions much earlier than methods like CKA. To shed light on this, we mathematically derive the sensitivity of CKA, angular Procrustes, and NBS to the variance of principal component dimensions, and explain the emphasis CKA places on high variance components. Finally, by jointly optimizing multiple similarity measures, we characterize their allowable ranges and reveal that some similarity measures are more constraining than others. While current measures offer a seemingly straightforward way to quantify the similarity between neural systems, our work underscores the need for careful interpretation. We hope the tools we developed will be used by practitioners to better understand current and future similarity measures.

Differentiable Optimization of Similarity Scores Between Models and Brains

TL;DR

The paper tackles the interpretability of representational similarity metrics used to compare models and brains by differentiating through these measures to maximize similarity. It introduces a differentiable optimization framework that yields synthetic datasets Y aligned with neural data X under various metrics (CKA, angular CKA, angular Procrustes, NBS, and regression-based scores) and assesses whether high similarity equates to task-relevant encoding. The study reveals that the meaning of a ‘good’ score is metric- and dataset-dependent, and that high similarity does not guarantee neural-consistent encoding, with CKA biased toward high-variance components. The authors also derive theoretical relationships showing CKA’s quadratic dependence on high-variance PCs versus NBS’s linear dependence, and demonstrate how jointly optimizing multiple metrics defines feasible score ranges, underscoring the need for careful interpretation and providing open-source tooling for standardization. Together, these findings offer a more nuanced framework for using similarity measures in neuroscience and AI, and tools to benchmark and interpret future metrics.

Abstract

How do we know if two systems - biological or artificial - process information in a similar way? Similarity measures such as linear regression, Centered Kernel Alignment (CKA), Normalized Bures Similarity (NBS), and angular Procrustes distance, are often used to quantify this similarity. However, it is currently unclear what drives high similarity scores and even what constitutes a "good" score. Here, we introduce a novel tool to investigate these questions by differentiating through similarity measures to directly maximize the score. Surprisingly, we find that high similarity scores do not guarantee encoding task-relevant information in a manner consistent with neural data; and this is particularly acute for CKA and even some variations of cross-validated and regularized linear regression. We find no consistent threshold for a good similarity score - it depends on both the measure and the dataset. In addition, synthetic datasets optimized to maximize similarity scores initially learn the highest variance principal component of the target dataset, but some methods like angular Procrustes capture lower variance dimensions much earlier than methods like CKA. To shed light on this, we mathematically derive the sensitivity of CKA, angular Procrustes, and NBS to the variance of principal component dimensions, and explain the emphasis CKA places on high variance components. Finally, by jointly optimizing multiple similarity measures, we characterize their allowable ranges and reveal that some similarity measures are more constraining than others. While current measures offer a seemingly straightforward way to quantify the similarity between neural systems, our work underscores the need for careful interpretation. We hope the tools we developed will be used by practitioners to better understand current and future similarity measures.
Paper Structure (22 sections, 23 equations, 12 figures)

This paper contains 22 sections, 23 equations, 12 figures.

Figures (12)

  • Figure 1: (a)To better understand the properties of similarity measures we optimize synthetic datasets to become more similar to a reference dataset, for example, neural recordings. (b) We analyzed similarity scores between artificial datasets and electrode recordings from five experiments on nonhuman primates spanning a diverse range of behaviors and brain regions.
  • Figure 2: Different similarity measures do not agree on the relative rankings when comparing models to neural datasets. One example application of similarity measures is to evaluate the similarity of task-optimized recurrent neural networks to neural datasets. We consider two neural datasets from (a) prefrontal cortex (PFC) Mante2013 and (b) Frontal Eye Field (FEF) Siegel2015 in monkeys performing an experimental task that required the animal to attend to either color or motion information while ignoring the non-cued feature of the stimuli. (c, d) RNNs with three different architectures, CTRNN, LowPassCTRNN, LSTM and three different nonlinearities, ReLU, ReTanh, Tanh are compared to neural datasets (see appendix \ref{['appendix:models_and_datasets']} for details).
  • Figure 3: What constitutes a good score varies depending on the similarity measure and the dataset. Decode accuracy for experimental variables versus similarity scores. The experimental variables are color vs motion contexts (binary variable) for Mante 2013 and Siegel 2015, reaching direction (total of 8 directions) for Hatsopoulos 2007, object categories (total of 8 categories) for MajajHong 2015, and texture vs noise categories (binary variable) for FreemanZiemba 2013. Horizontal dashed lines show the decode accuracy from the neural data (upper line) and chance level (lower line). Colored dots above the x-axis indicate the similarity scores when the decode accuracy reaches 90% midway between chance level and the decode accuracy from the reference neural dataset.
  • Figure 4: Different similarity measures differentially prioritize learning principal components of the data.(a) Reference dataset used as a target during optimization. (b, c) Initial Gaussian random noise data is updated to maximize similarity with the reference dataset, as quantified by one of the similarity measures. The transformation of the random noise dataset is shown at the top of panel b. The first principal component of the reference dataset is increasingly well captured by the optimized data as the similarity scores increase (yellow curves). The second, lower variance, component is also learned when maximizing the angular Procrustes similarity but is only captured at high similarity scores when maximizing linear regression, CKA, and angular CKA similarity.(d) Four reference datasets with decreasing variance along the second principal component. (e) Similarity measures capture both principal components when their variance is approximately equal. However, when the variance differs, CKA and linear regression preferentially neglect the low variance component (curves colored according to asymmetry of variance distribution).
  • Figure 5: (a)A randomly initialized synthetic dataset is updated to maximize the similarity with a neural dataset, taken here to be the FEF dataset from Siegel2015. The principal components (PCs) of this reference dataset are captured by the optimized dataset at different similarity scores, which in subsequent figures we call the score to reach the PC threshold. (b) The score to reach the PC threshold for the Mante2013 dataset is shown as a function of the variance explained by each PC. The highest variance PC is learned first during optimization at the lowest similarity score (bottom right of figure). A vertical slice through the figure shows the similarity score required to capture a specific PC. For example, to capture the PC at $10^{-2}$ requires a much lower similarity score when maximizing angular Procrustes versus CKA (light blue curve is below the red curve). (c, d) The reference dataset used as a target during optimization is the neural activity from Siegel2015 FEF (panel c) and MajajHong2015 (panel d). (e) The neural data points are the same as in panels c and d (colored dots). The similarity scores at which PCs of this neural activity are learned, is well predicted by replacing neural activity with random Gaussian datasets that have a matching distribution of variances for each PC (black curves).
  • ...and 7 more figures