Table of Contents
Fetching ...

How Aligned are Different Alignment Metrics?

Jannis Ahlert, Thomas Klein, Felix Wichmann, Robert Geirhos

TL;DR

The paper addresses whether diverse alignment metrics cohere in capturing brain-like representations in artificial systems. By analyzing Brain-Score alongside additional human-alignment datasets and computing pairwise Spearman correlations across up to 241 models and 50 metrics, it finds that neural and behavioral metrics often diverge, with an average cross-metric correlation as low as $0.198$ in some comparisons, highlighting a multidimensional view of alignment. It then evaluates aggregation strategies (arithmetic mean, z-transform, mean rank) and shows that aggregation choices can substantially shift model rankings, underscoring the need for principled integration of metrics. The work argues for more integrative, axiomatic benchmarking that respects the distinct dimensions captured by different metrics, which has important implications for evaluating and improving brain-like perception in artificial systems.

Abstract

In recent years, various methods and benchmarks have been proposed to empirically evaluate the alignment of artificial neural networks to human neural and behavioral data. But how aligned are different alignment metrics? To answer this question, we analyze visual data from Brain-Score (Schrimpf et al., 2018), including metrics from the model-vs-human toolbox (Geirhos et al., 2021), together with human feature alignment (Linsley et al., 2018; Fel et al., 2022) and human similarity judgements (Muttenthaler et al., 2022). We find that pairwise correlations between neural scores and behavioral scores are quite low and sometimes even negative. For instance, the average correlation between those 80 models on Brain-Score that were fully evaluated on all 69 alignment metrics we considered is only 0.198. Assuming that all of the employed metrics are sound, this implies that alignment with human perception may best be thought of as a multidimensional concept, with different methods measuring fundamentally different aspects. Our results underline the importance of integrative benchmarking, but also raise questions about how to correctly combine and aggregate individual metrics. Aggregating by taking the arithmetic average, as done in Brain-Score, leads to the overall performance currently being dominated by behavior (95.25% explained variance) while the neural predictivity plays a less important role (only 33.33% explained variance). As a first step towards making sure that different alignment metrics all contribute fairly towards an integrative benchmark score, we therefore conclude by comparing three different aggregation options.

How Aligned are Different Alignment Metrics?

TL;DR

The paper addresses whether diverse alignment metrics cohere in capturing brain-like representations in artificial systems. By analyzing Brain-Score alongside additional human-alignment datasets and computing pairwise Spearman correlations across up to 241 models and 50 metrics, it finds that neural and behavioral metrics often diverge, with an average cross-metric correlation as low as in some comparisons, highlighting a multidimensional view of alignment. It then evaluates aggregation strategies (arithmetic mean, z-transform, mean rank) and shows that aggregation choices can substantially shift model rankings, underscoring the need for principled integration of metrics. The work argues for more integrative, axiomatic benchmarking that respects the distinct dimensions captured by different metrics, which has important implications for evaluating and improving brain-like perception in artificial systems.

Abstract

In recent years, various methods and benchmarks have been proposed to empirically evaluate the alignment of artificial neural networks to human neural and behavioral data. But how aligned are different alignment metrics? To answer this question, we analyze visual data from Brain-Score (Schrimpf et al., 2018), including metrics from the model-vs-human toolbox (Geirhos et al., 2021), together with human feature alignment (Linsley et al., 2018; Fel et al., 2022) and human similarity judgements (Muttenthaler et al., 2022). We find that pairwise correlations between neural scores and behavioral scores are quite low and sometimes even negative. For instance, the average correlation between those 80 models on Brain-Score that were fully evaluated on all 69 alignment metrics we considered is only 0.198. Assuming that all of the employed metrics are sound, this implies that alignment with human perception may best be thought of as a multidimensional concept, with different methods measuring fundamentally different aspects. Our results underline the importance of integrative benchmarking, but also raise questions about how to correctly combine and aggregate individual metrics. Aggregating by taking the arithmetic average, as done in Brain-Score, leads to the overall performance currently being dominated by behavior (95.25% explained variance) while the neural predictivity plays a less important role (only 33.33% explained variance). As a first step towards making sure that different alignment metrics all contribute fairly towards an integrative benchmark score, we therefore conclude by comparing three different aggregation options.
Paper Structure (15 sections, 7 figures)

This paper contains 15 sections, 7 figures.

Figures (7)

  • Figure 1: How aligned are different alignment metrics? Pairwise Spearman's rank correlations of different Brain-Score metrics, as well as odd-one-out similarity judgements and attention map similarity. Correlations that are significant after Bonferroni-correction are bold. We include only those $42$ models that were evaluated on all metrics. Note that (a) variance of correlation coefficients is quite high and (b) similar metrics tend to agree, with the exception of the odd-one-out similarities. See also \ref{['fig:appdx_pairwise_correlations']} for the $80$ Brain-Score models that have all scores on the Brain-Score metrics of this heatmap.
  • Figure 2: Left: Relationship between neural and behavioral scores. Coloring represents the model rank in Brain-Score, which is the average of the neural and behavioral scores (brighter color indicates higher overall score). Right: Same data after z-transforming both scores. The dominance of behavioral scores prevails, because there are extreme outliers in the neural scores.
  • Figure 3: Comparison of rankings resulting from different integration schemes. We compare three different ranking algorithms: (1) Arithmetic mean corresponds to the current standard in Brain-Score, where scores are simply averaged. (2) The z-transformed ranking is obtained by z-transforming scores first, then averaging them. (3) Mean Rank is the result of averaging the ranks implied by the metrics (ranks are inverted for consistency, so higher is still better). All scores are normalized to the interval $[0-1]$ using Min-Max Normalization and colored according to a model's position under the original integration scheme. Rank-order changes greater than $10$ positions (relative to the rank implied by the arithmetic mean) are highlighted in red. Spearman's rank correlations between the original ranking and the new ones are 0.47 (mean rank) and 0.92 (z-transformed mean).
  • Figure 4: How consistent are Brain-Score metrics? Pairwise Spearman's rank correlations of different metrics, from V1 to IT and finally to behavioral measures. Note how the agreement between the behavioral metrics is much higher than the agreement of the neural measures.
  • Figure 5: Pairwise Spearman's rank correlations of Brain-Score metrics, from V1 to IT and behavioral measures. In contrast to \ref{['fig:brainscore_correlations']}, we include all Brain-Score models that had a value available for each metric, not only the subset that was also evaluated on the metrics by muttenthaler2022human and fel2022harmonizing. This amounts to a total of $80$ models. The average correlation on this heatmap is 0.17, while the average correlation of the smaller set of models in \ref{['fig:brainscore_correlations']} is 0.21, exemplifying the need for further thorough evaluations.
  • ...and 2 more figures