Table of Contents
Fetching ...

ReSi: A Comprehensive Benchmark for Representational Similarity Measures

Max Klabunde, Tassilo Wald, Tobias Schumacher, Klaus Maier-Hein, Markus Strohmaier, Florian Lemmerich

TL;DR

ReSi introduces a rigorous, extensible benchmark for representational similarity measures, spanning graph, language, and vision domains with six grounding-based tests, 24 measures, 14 architectures, and seven datasets. It provides a principled evaluation framework distinguishing grounding by prediction and grounding by design, enabling robust comparisons and reproducibility. The findings reveal that no single measure consistently dominates across domains, highlighting domain-specific strengths and the critical role of preprocessing and grounding choices. By making all components public and extensible, ReSi offers a practical platform to develop, compare, and apply representational similarity measures in real neural architectures.

Abstract

Measuring the similarity of different representations of neural architectures is a fundamental task and an open research challenge for the machine learning community. This paper presents the first comprehensive benchmark for evaluating representational similarity measures based on well-defined groundings of similarity. The representational similarity (ReSi) benchmark consists of (i) six carefully designed tests for similarity measures, (ii) 24 similarity measures, (iii) 14 neural network architectures, and (iv) seven datasets, spanning over the graph, language, and vision domains. The benchmark opens up several important avenues of research on representational similarity that enable novel explorations and applications of neural architectures. We demonstrate the utility of the ReSi benchmark by conducting experiments on various neural network architectures, real world datasets and similarity measures. All components of the benchmark are publicly available and thereby facilitate systematic reproduction and production of research results. The benchmark is extensible, future research can build on and further expand it. We believe that the ReSi benchmark can serve as a sound platform catalyzing future research that aims to systematically evaluate existing and explore novel ways of comparing representations of neural architectures.

ReSi: A Comprehensive Benchmark for Representational Similarity Measures

TL;DR

ReSi introduces a rigorous, extensible benchmark for representational similarity measures, spanning graph, language, and vision domains with six grounding-based tests, 24 measures, 14 architectures, and seven datasets. It provides a principled evaluation framework distinguishing grounding by prediction and grounding by design, enabling robust comparisons and reproducibility. The findings reveal that no single measure consistently dominates across domains, highlighting domain-specific strengths and the critical role of preprocessing and grounding choices. By making all components public and extensible, ReSi offers a practical platform to develop, compare, and apply representational similarity measures in real neural architectures.

Abstract

Measuring the similarity of different representations of neural architectures is a fundamental task and an open research challenge for the machine learning community. This paper presents the first comprehensive benchmark for evaluating representational similarity measures based on well-defined groundings of similarity. The representational similarity (ReSi) benchmark consists of (i) six carefully designed tests for similarity measures, (ii) 24 similarity measures, (iii) 14 neural network architectures, and (iv) seven datasets, spanning over the graph, language, and vision domains. The benchmark opens up several important avenues of research on representational similarity that enable novel explorations and applications of neural architectures. We demonstrate the utility of the ReSi benchmark by conducting experiments on various neural network architectures, real world datasets and similarity measures. All components of the benchmark are publicly available and thereby facilitate systematic reproduction and production of research results. The benchmark is extensible, future research can build on and further expand it. We believe that the ReSi benchmark can serve as a sound platform catalyzing future research that aims to systematically evaluate existing and explore novel ways of comparing representations of neural architectures.
Paper Structure (64 sections, 32 equations, 15 figures, 31 tables)

This paper contains 64 sections, 32 equations, 15 figures, 31 tables.

Figures (15)

  • Figure 1: Grounding similarity. In all tests within the ReSi benchmark, we design a set of models for which we can establish a ground-truth about the similarity of their representations. The left heatmap illustrates the true similarity between a set of models, the other heatmaps the similarity values that different similarity measures assign to each model pair via their representations. We rank similarity measures by their ability to capture the ground truth. In practice, a ground-truth similarity between models is usually hard to attain. For the ReSi benchmark, we design tests where similarity is practically grounded.
  • Figure 2: Illustration of grounding approaches. We consider two approaches to establish ground-truths for representational similarity. When grounding by prediction, we evaluate whether differences in representation matrices correspond to differences in predictions of models, as, for instance, measured by Jensen-Shannon divergence (JSD). Ideally, a representational similarity measure $m$ perfectly correlates with output similarity. When grounding by design, we design groups of models that are similar within and dissimilar across groups. A representational similarity measure $m$ should distinguish these groups accordingly.
  • Figure 3: Aggregated ranks of measures across all models and tests, separated by domain. Lower is better. Measures are ordered by their median rank, and categorized according to the taxonomy by klabunde_similarity_2023. Tied measures all receive the best rank of their group. Boxplots indicate quartiles of rank distributions, the whiskers extend up to 1.5 times the inter-quartile range. No single measure or category stands out across all domains.
  • Figure 4: Validation accuracies of GNN models trained with standard parametrization. We used standard train/validation/test-splits provided with the given datasets. Test accuracies largely correspond to those obtained in common benchmarks.
  • Figure 5: Validation accuracies of GNN models trained for Test 3 (Label Randomization). Accuracies were computed on test sets with regular labels. With increasing degree of randomization of target labels, performance degraded strongly. Clusters of accuracies per group are clearly separated.
  • ...and 10 more figures