Table of Contents
Fetching ...

Comparing Computational Pathology Foundation Models using Representational Similarity Analysis

Vaibhav Mishra, William Lotter

TL;DR

This work addresses how computational pathology foundation models organize information beyond downstream task accuracy by applying Representational Similarity Analysis (RSA) to six models spanning vision-language and vision-only self-distillation paradigms. Using TCGA H&E patches, the study constructs and compares Representational Dissimilarity Matrices to reveal cross-model similarities, slide- and disease-dependence, and intrinsic dimensionality, finding that UNI2 and Virchow2 are most distinct while Prov-GigaPath shows broad cross-model alignment; slide-specific signals dominate over disease signals, and stain normalization reduces slide-dependence. Vision-language models tend to occupy more compact representational spaces than vision-only models, suggesting different encoding bottlenecks tied to training objectives. The findings inform model ensembling and robustness considerations, highlighting that representational structure is shaped by training strategy and data characteristics, and demonstrate a generalizable RSA framework for medical imaging foundations.

Abstract

Foundation models are increasingly developed in computational pathology (CPath) given their promise in facilitating many downstream tasks. While recent studies have evaluated task performance across models, less is known about the structure and variability of their learned representations. Here, we systematically analyze the representational spaces of six CPath foundation models using techniques popularized in computational neuroscience. The models analyzed span vision-language contrastive learning (CONCH, PLIP, KEEP) and self-distillation (UNI (v2), Virchow (v2), Prov-GigaPath) approaches. Through representational similarity analysis using H&E image patches from TCGA, we find that UNI2 and Virchow2 have the most distinct representational structures, whereas Prov-Gigapath has the highest average similarity across models. Having the same training paradigm (vision-only vs. vision-language) did not guarantee higher representational similarity. The representations of all models showed a high slide-dependence, but relatively low disease-dependence. Stain normalization decreased slide-dependence for all models by a range of 5.5% (CONCH) to 20.5% (PLIP). In terms of intrinsic dimensionality, vision-language models demonstrated relatively compact representations, compared to the more distributed representations of vision-only models. These findings highlight opportunities to improve robustness to slide-specific features, inform model ensembling strategies, and provide insights into how training paradigms shape model representations. Our framework is extendable across medical imaging domains, where probing the internal representations of foundation models can support their effective development and deployment.

Comparing Computational Pathology Foundation Models using Representational Similarity Analysis

TL;DR

This work addresses how computational pathology foundation models organize information beyond downstream task accuracy by applying Representational Similarity Analysis (RSA) to six models spanning vision-language and vision-only self-distillation paradigms. Using TCGA H&E patches, the study constructs and compares Representational Dissimilarity Matrices to reveal cross-model similarities, slide- and disease-dependence, and intrinsic dimensionality, finding that UNI2 and Virchow2 are most distinct while Prov-GigaPath shows broad cross-model alignment; slide-specific signals dominate over disease signals, and stain normalization reduces slide-dependence. Vision-language models tend to occupy more compact representational spaces than vision-only models, suggesting different encoding bottlenecks tied to training objectives. The findings inform model ensembling and robustness considerations, highlighting that representational structure is shaped by training strategy and data characteristics, and demonstrate a generalizable RSA framework for medical imaging foundations.

Abstract

Foundation models are increasingly developed in computational pathology (CPath) given their promise in facilitating many downstream tasks. While recent studies have evaluated task performance across models, less is known about the structure and variability of their learned representations. Here, we systematically analyze the representational spaces of six CPath foundation models using techniques popularized in computational neuroscience. The models analyzed span vision-language contrastive learning (CONCH, PLIP, KEEP) and self-distillation (UNI (v2), Virchow (v2), Prov-GigaPath) approaches. Through representational similarity analysis using H&E image patches from TCGA, we find that UNI2 and Virchow2 have the most distinct representational structures, whereas Prov-Gigapath has the highest average similarity across models. Having the same training paradigm (vision-only vs. vision-language) did not guarantee higher representational similarity. The representations of all models showed a high slide-dependence, but relatively low disease-dependence. Stain normalization decreased slide-dependence for all models by a range of 5.5% (CONCH) to 20.5% (PLIP). In terms of intrinsic dimensionality, vision-language models demonstrated relatively compact representations, compared to the more distributed representations of vision-only models. These findings highlight opportunities to improve robustness to slide-specific features, inform model ensembling strategies, and provide insights into how training paradigms shape model representations. Our framework is extendable across medical imaging domains, where probing the internal representations of foundation models can support their effective development and deployment.

Paper Structure

This paper contains 22 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Example Representational Dissimilarity Matrices. The RDMs are computed across 10,000 image patches (50 patches from 50 WSIs for the 4 cancer types) and represent the Euclidean distance between the model representations for each pair of patches, normalized to [0, 1] for each matrix.
  • Figure 2: Spearman correlation between the RDMs of each pair of models. The mean and range across the 5 batches are displayed.
  • Figure 3: Hierarchical clustering (Ward's method) of the Spearman correlation matrix between the model RDMs.
  • Figure 4: Spectral analysis. Singular value decomposition was performed on the foundation model representations. Singular values were normalized to sum to 1 and the cumulative sum is plotted with respect to the percentage of features included. Solid lines indicate vision-only models; dashed lines indicate vision-language models.
  • Figure 5: Spearman correlation between the RDMs of each pair of models using stain-normalized patches. Analogous to Figure \ref{['fig:spearman_heatmap']} except using the stain-normalized patches.
  • ...and 3 more figures