Table of Contents
Fetching ...

Bayesian Comparisons Between Representations

Heiko H. Schütt

TL;DR

The paper introduces a Bayesian framework for comparing neural representations by analyzing the predictive distributions of linear readouts from intermediate representations. It derives analytically tractable prior predictive distributions under Gaussian priors and noise, with distances between predictive distributions, specifically the Jensen-Shannon Distance and Total Variation Distance, serving as pseudo-metrics on representations that connect to kernel-based measures via the linear kernel XX^T. The authors demonstrate the approach on ImageNet-1k and a neural-physiology subset, showing that Bayesian distances capture meaningful representational structure, exhibit informative uncertainty, and correlate with, yet remain distinct from, existing metrics like RSA and CK A. They also provide stability analyses and neural-data evaluations, illustrating practical advantages for model comparison and stimulus-design considerations in neuroscience-inspired machine learning. Overall, the work offers a principled, uncertainty-aware toolkit for representational comparison that complements and extends current kernel- and probe-based approaches.

Abstract

Which neural networks are similar is a fundamental question for both machine learning and neuroscience. Here, it is proposed to base comparisons on the predictive distributions of linear readouts from intermediate representations. In Bayesian statistics, the prior predictive distribution is a full description of the inductive bias and generalization of a model, making it a great basis for comparisons. This distribution directly gives the evidence a dataset would provide in favor of the model. If we want to compare multiple models to each other, we can use a metric for probability distributions like the Jensen-Shannon distance or the total variation distance. As these are metrics, this induces pseudo-metrics for representations, which measure how well two representations could be distinguished based on a linear read out. For a linear readout with a Gaussian prior on the read-out weights and Gaussian noise, we can analytically compute the (prior and posterior) predictive distributions without approximations. These distributions depend only on the linear kernel matrix of the representations in the model. Thus, the Bayesian metrics connect to both linear read-out based comparisons and kernel based metrics like centered kernel alignment and representational similarity analysis. The new methods are demonstrated with deep neural networks trained on ImageNet-1k comparing them to each other and a small subset of the Natural Scenes Dataset. The Bayesian comparisons are correlated to but distinct from existing metrics. Evaluations vary slightly less across random image samples and yield informative results with full uncertainty information. Thus the proposed Bayesian metrics nicely extend our toolkit for comparing representations.

Bayesian Comparisons Between Representations

TL;DR

The paper introduces a Bayesian framework for comparing neural representations by analyzing the predictive distributions of linear readouts from intermediate representations. It derives analytically tractable prior predictive distributions under Gaussian priors and noise, with distances between predictive distributions, specifically the Jensen-Shannon Distance and Total Variation Distance, serving as pseudo-metrics on representations that connect to kernel-based measures via the linear kernel XX^T. The authors demonstrate the approach on ImageNet-1k and a neural-physiology subset, showing that Bayesian distances capture meaningful representational structure, exhibit informative uncertainty, and correlate with, yet remain distinct from, existing metrics like RSA and CK A. They also provide stability analyses and neural-data evaluations, illustrating practical advantages for model comparison and stimulus-design considerations in neuroscience-inspired machine learning. Overall, the work offers a principled, uncertainty-aware toolkit for representational comparison that complements and extends current kernel- and probe-based approaches.

Abstract

Which neural networks are similar is a fundamental question for both machine learning and neuroscience. Here, it is proposed to base comparisons on the predictive distributions of linear readouts from intermediate representations. In Bayesian statistics, the prior predictive distribution is a full description of the inductive bias and generalization of a model, making it a great basis for comparisons. This distribution directly gives the evidence a dataset would provide in favor of the model. If we want to compare multiple models to each other, we can use a metric for probability distributions like the Jensen-Shannon distance or the total variation distance. As these are metrics, this induces pseudo-metrics for representations, which measure how well two representations could be distinguished based on a linear read out. For a linear readout with a Gaussian prior on the read-out weights and Gaussian noise, we can analytically compute the (prior and posterior) predictive distributions without approximations. These distributions depend only on the linear kernel matrix of the representations in the model. Thus, the Bayesian metrics connect to both linear read-out based comparisons and kernel based metrics like centered kernel alignment and representational similarity analysis. The new methods are demonstrated with deep neural networks trained on ImageNet-1k comparing them to each other and a small subset of the Natural Scenes Dataset. The Bayesian comparisons are correlated to but distinct from existing metrics. Evaluations vary slightly less across random image samples and yield informative results with full uncertainty information. Thus the proposed Bayesian metrics nicely extend our toolkit for comparing representations.

Paper Structure

This paper contains 40 sections, 17 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Minimal example for the Bayesian comparison framework: two stimuli in two 2D representations A: The original representations of the two stimuli. B: Predictive Distributions induced by a linear read out model with a zero mean Gaussian weight prior. These are the distributions we compare to determine the (dis-)similarity of representations. C: Prediction for Stimulus 2 according to the two models if a value of 1 for Stimulus 1 is given as training data. In the Bayesian statistics these are computed by conditioning the distribution in B, corresponding to the cut at the dashed line through a value of 1 for Stimulus 1.
  • Figure 2: Dependence of the Jensen Shannon Divergence (JSD) on the number of images used and the signal to noise ratio for two comparisons within a standard ResNet-18. For the signal to noise ratio, two labels are shown: the noise variance for signal variance 1 and the mixture factor $a$ as defined in the text. The red line shows the slope such that the noise variance is proportional to the number of images. Note that JSD is fairly constant along this line once enough images are collected, while JSD gets small for few images independent of the Signal to noise ratio. Left: Divergence between two close representations---the outputs of the first layer and the output of the fist block of the second layer. Right: Divergence between two different representations---the first block in the first layer and the last block in the last convolution layer.
  • Figure 3: Example analysis based on the Jensen-Shannon Distance as proposed here. A range of layers from AlexNet, ResNet-18 and the Vision Transformer B-16 (ViT-B-16) are compared based on 200 randomly chosen natural images. Weights for all networks were obtained from torchvision and were originally trained on ImageNet-1k .A: Distance matrix according to the Jensen-Shannon Distance including the pre-processed input images. B: Metric MDS embedding of the layers into a 2D space with arbitrary units.
  • Figure 4: Comparisons between metrics based on all pairwise distances between layers of Alexnet and Resnet-18 using 100 random unlabeled images from MS COCO as inputs. A: Plotting different dissimilarity measures against the total variation distance proposed here. The number above each plot gives the Pearson correlation $\pm$ the coarse analytic estimate of the standard deviation $(1-r^2)/\sqrt{N-3}$gnambs_brief_2023. B: The matrix of pairwise comparisons between layers according to the different dissimilarity measures. Compared measures: Jensen-Shannon-Distance & Total variation distance as proposed here. Centered Kernel Alignment: One minus the linear centered kernel alignment. Generalized Shape Metric: $\arccos$ of the centered kernel alignment, which is a shape metric williams_generalized_2021. RSA 1-correlation: Representational similarity analysis based on one minus the Pearson correlation of euclidean distances. RSA arc-cosine: Representational similarity analysis based on the $\arccos$ of the cosine similarity of euclidean distances.
  • Figure 5: Evaluation of AlexNet and ResNet-18 layers at predicting fMRI voxels measured in left PPA for the natural scenes dataset allen_massive_2021 as prepared for the Algonauts challenge 2023 gifford_algonauts_2023 using the first 500 images as training and the next 500 for evaluation. A: Evaluation of the prior prediction on the training data. Top shows the number of voxels that each of the layers performed best for. Bottom shows the distribution of log-Posterior values for each layer, which is proportional to the evidence in favor of each layer. Clearly some of the layers could be excluded based on this data. B: Comparison of the posterior TVD (based on the distribution conditioned on the training data) to the prior TVD (based on the prior weight distribution) for the first voxel of PPA in the dataset. The biggest changes are for layers that perform badly, which become more similar to the ones that perform well, but some layers that perform similarly become more different. C: Same data as in B, but plotting the two TVDs against each other.
  • ...and 1 more figures