Table of Contents
Fetching ...

Representational Difference Explanations

Neehar Kondapaneni, Oisin Mac Aodha, Pietro Perona

TL;DR

Representational Difference Explanations (RDX) introduces a training-free, difference-centric framework to contrast two model representations and visualize where they disagree. By constructing neighborhood-based distance matrices, applying a locally biased difference function, and sampling difference explanations via spectral clustering (with optional alignment via centered kernel alignment), RDX yields interpretable concept grids that highlight model-specific distinctions. Empirical results show RDX reliably recovers known differences and uncovers previously unknown ones across MNIST-inspired tasks and large vision models on ImageNet/iNaturalist, outperforming dictionary-learning XAI baselines on the primary metric (Binary Success Rate) and related semantics metrics. The method offers a practical tool for model comparison with broad applicability, while acknowledging limitations in scalability, distance assumptions, and potential biases in external evaluators; future work may extend RDX to text and multimodal representations and integrate supervised cues for enhanced interpretability.

Abstract

We propose a method for discovering and visualizing the differences between two learned representations, enabling more direct and interpretable model comparisons. We validate our method, which we call Representational Differences Explanations (RDX), by using it to compare models with known conceptual differences and demonstrate that it recovers meaningful distinctions where existing explainable AI (XAI) techniques fail. Applied to state-of-the-art models on challenging subsets of the ImageNet and iNaturalist datasets, RDX reveals both insightful representational differences and subtle patterns in the data. Although comparison is a cornerstone of scientific analysis, current tools in machine learning, namely post hoc XAI methods, struggle to support model comparison effectively. Our work addresses this gap by introducing an effective and explainable tool for contrasting model representations.

Representational Difference Explanations

TL;DR

Representational Difference Explanations (RDX) introduces a training-free, difference-centric framework to contrast two model representations and visualize where they disagree. By constructing neighborhood-based distance matrices, applying a locally biased difference function, and sampling difference explanations via spectral clustering (with optional alignment via centered kernel alignment), RDX yields interpretable concept grids that highlight model-specific distinctions. Empirical results show RDX reliably recovers known differences and uncovers previously unknown ones across MNIST-inspired tasks and large vision models on ImageNet/iNaturalist, outperforming dictionary-learning XAI baselines on the primary metric (Binary Success Rate) and related semantics metrics. The method offers a practical tool for model comparison with broad applicability, while acknowledging limitations in scalability, distance assumptions, and potential biases in external evaluators; future work may extend RDX to text and multimodal representations and integrate supervised cues for enhanced interpretability.

Abstract

We propose a method for discovering and visualizing the differences between two learned representations, enabling more direct and interpretable model comparisons. We validate our method, which we call Representational Differences Explanations (RDX), by using it to compare models with known conceptual differences and demonstrate that it recovers meaningful distinctions where existing explainable AI (XAI) techniques fail. Applied to state-of-the-art models on challenging subsets of the ImageNet and iNaturalist datasets, RDX reveals both insightful representational differences and subtle patterns in the data. Although comparison is a cornerstone of scientific analysis, current tools in machine learning, namely post hoc XAI methods, struggle to support model comparison effectively. Our work addresses this gap by introducing an effective and explainable tool for contrasting model representations.

Paper Structure

This paper contains 46 sections, 11 equations, 21 figures, 11 tables, 1 algorithm.

Figures (21)

  • Figure 1: Intuition behind our method.Representational Difference Explanations (RDX) aim to highlight the substantive differences between two representations (e.g., ${\bm{A}}$ and ${\bm{B}}$, which are the embedding matrices produced by two different models for the same set of data). Here ${\bm{A}}$ supports discrimination between circles and squares, whereas ${\bm{B}}$ does not. Clustering the two representations independently would not reveal the square/circle sub-structure unique to ${\bm{A}}$. By "subtracting" ${\bm{B}}$ from ${\bm{A}}$, RDX reveals which items are considered similar in ${\bm{A}}$, but not in ${\bm{B}}$. RDX isolates differences, and ignores data that may be equally well grouped in both representations, such as the triangles and diamonds.
  • Figure 2: Comparing RDX to NMF. We train a small CNN on a modified MNIST dataset that only contains images of the digits 3, 5, and 8. We compare a strong model checkpoint representation (${\bm{M}}_S$, 95% accuracy) with a final 'expert' model representation (${\bm{M}}_E$, 98% accuracy). The left and middle columns show PCA projections of the ${\bm{M}}_S$ and ${\bm{M}}_E$ representations, respectively. The transparent colors indicate classes in the dataset: 3 (light-blue), 5 (light-orange), and 8 (light-green). The right most columns visualize the images selected by the explanation methods. We extract three concepts for each method. (A) We generate explanations using NMFfel2023craft with maximum sampling fel2023holisticfel2023craftkonda2025rsvc for ${\bm{M}}_S$ and ${\bm{M}}_E$. Bold colored points on the PCA plots indicate the location of the sampled images seen in the right-most column. We find that NMF is unable to reveal any representational difference between ${\bm{M}}_S$ and ${\bm{M}}_E$ because it produces indistinguishable explanations for both models. (B) In contrast, RDX discovers concepts unique to ${\bm{M}}_S$ by identifying images that are more similar in ${\bm{M}}_S$ than in ${\bm{M}}_E$. The sampled points are overlaid on both models’ representations and show tight clusters in ${\bm{M}}_S$ that contrast with diffuse points in ${\bm{M}}_E$. The right column shows the corresponding explanations, highlighting how model representations differ.
  • Figure 3: Binary success rate evaluation of XAI methods. For each XAI method, we compute the binary success rate (BSR) (\ref{['sec:eval_expl']}) on all difference experiments, where higher is better. We use neighborhood distances to measure BSR (\ref{['sec:norm_dists']}). Each method (x-axis) is assigned a different color, we show $\mathtt{BSR}(\mathcal{E}^{\bm{A}})$ (darker box) and $\mathtt{BSR}(\mathcal{E}^{\bm{B}})$ (lighter box). (A) We show results on the MNIST and CUB PCBM experiments (\ref{['sec:known_diff']}), in which we modify a representation and test if RDX can help identify the modification. (B) We show results when comparing large vision models with unknown differences (\ref{['sec:unknown_diff']}). We compare recovering differences without (left) and with (right) an initial alignment step (\ref{['sec:rep_align']}). In all cases, our RDX approach consistently outperforms the dictionary learning baselines. A complete set of results is available in \ref{['tab:metrics_bsr_table']}.
  • Figure 4: Recovering vertical flip modifications on MNIST. We visualize explanations produced by three XAI methods, $\mathtt{RDX}$, KMeans, and NMF, to compare models ${\bm{M}}_{\updownarrow}$ and ${\bm{M}}_{\uparrow \downarrow}$. Both models are trained on a dataset with vertically flipped and normal images. ${\bm{M}}_{\updownarrow}$ is trained to associate the original label to flipped digits and ${\bm{M}}_{\uparrow \downarrow}$ is trained to predict a new set of labels for flipped digits. We expect ${\bm{M}}_{\updownarrow}$ to mix flipped and unflipped digits while ${\bm{M}}_{\uparrow \downarrow}$ should separate them. We generate three explanations for each method. (Left, Middle)KMeans and NMF generate explanations that are difficult to understand. (Right)RDX captures the expected difference. $\mathtt{RDX}({\bm{M}}_{\updownarrow}, {\bm{M}}_{\uparrow \downarrow})$ reveals that ${\bm{M}}_{\updownarrow}$ represents flipped and unflipped 6s, 7s, and 9s closer together than in ${\bm{M}}_{\uparrow \downarrow}$. $\mathtt{RDX}({\bm{M}}_{\uparrow \downarrow}, {\bm{M}}_{\updownarrow})$ shows that ${\bm{M}}_{\uparrow \downarrow}$ has clean clusters of 3s, flipped 5s, and flipped 2s without any mixing.
  • Figure 5: Recovering the "Spotted Wing" concept in CUB. We train a post-hoc concept bottleneck model on the CUB dataset. For each image, we use the predicted concept scores as the image's embedding vector (i.e., representation). Here we compare a model using the complete concept representation (${\bm{C}}_A$) with a model representation without the spotted wing concept (${\bm{C}}_{A-S}$). We visualize one of five generated explanations for each model using $\mathtt{RDX}$ and $\mathtt{CNMF}$. We observe that $\mathtt{RDX}$'s explanation focuses on the spotted wing concept. It shows us that only ${\bm{C}}_{A-S}$ mixes images with and without spotted wings. In contrast, the CNMF explanations for each model are both unrelated to the spotted wing concept and indistinguishable from each other, since the representations are highly similar and CNMF discovers nearly the same concepts in both. See \ref{['fig:cub_supp_p1']} for all five explanations.
  • ...and 16 more figures