Table of Contents
Fetching ...

Benchmarking Embedding Aggregation Methods in Computational Pathology: A Clinical Data Perspective

Shengjia Chen, Gabriele Campanella, Abdulkadir Elmas, Aryeh Stock, Jennifer Zeng, Alexandros D. Polydorides, Adam J. Schoenfeld, Kuan-lin Huang, Jane Houldsworth, Chad Vanderbilt, Thomas J. Fuchs

TL;DR

This study addresses the gap between public-domain WSIs benchmarks and clinical practice by systematically evaluating ten slide-level embedding aggregation methods across nine clinically relevant tasks. It compares domain-specific histology foundation-model embeddings against ImageNet-trained ones and analyzes the impact of spatial information on various aggregators. Key findings show domain-specific FMs generally outperform generic ones, with spatial-aware methods yielding limited or task-dependent gains, and no single method excelling across all tasks. The work provides practical guidelines and an open benchmarking pipeline to advance clinically applicable aggregation techniques in computational pathology.

Abstract

Recent advances in artificial intelligence (AI), in particular self-supervised learning of foundation models (FMs), are revolutionizing medical imaging and computational pathology (CPath). A constant challenge in the analysis of digital Whole Slide Images (WSIs) is the problem of aggregating tens of thousands of tile-level image embeddings to a slide-level representation. Due to the prevalent use of datasets created for genomic research, such as TCGA, for method development, the performance of these techniques on diagnostic slides from clinical practice has been inadequately explored. This study conducts a thorough benchmarking analysis of ten slide-level aggregation techniques across nine clinically relevant tasks, including diagnostic assessment, biomarker classification, and outcome prediction. The results yield following key insights: (1) Embeddings derived from domain-specific (histological images) FMs outperform those from generic ImageNet-based models across aggregation methods. (2) Spatial-aware aggregators enhance the performance significantly when using ImageNet pre-trained models but not when using FMs. (3) No single model excels in all tasks and spatially-aware models do not show general superiority as it would be expected. These findings underscore the need for more adaptable and universally applicable aggregation techniques, guiding future research towards tools that better meet the evolving needs of clinical-AI in pathology. The code used in this work is available at \url{https://github.com/fuchs-lab-public/CPath_SABenchmark}.

Benchmarking Embedding Aggregation Methods in Computational Pathology: A Clinical Data Perspective

TL;DR

This study addresses the gap between public-domain WSIs benchmarks and clinical practice by systematically evaluating ten slide-level embedding aggregation methods across nine clinically relevant tasks. It compares domain-specific histology foundation-model embeddings against ImageNet-trained ones and analyzes the impact of spatial information on various aggregators. Key findings show domain-specific FMs generally outperform generic ones, with spatial-aware methods yielding limited or task-dependent gains, and no single method excelling across all tasks. The work provides practical guidelines and an open benchmarking pipeline to advance clinically applicable aggregation techniques in computational pathology.

Abstract

Recent advances in artificial intelligence (AI), in particular self-supervised learning of foundation models (FMs), are revolutionizing medical imaging and computational pathology (CPath). A constant challenge in the analysis of digital Whole Slide Images (WSIs) is the problem of aggregating tens of thousands of tile-level image embeddings to a slide-level representation. Due to the prevalent use of datasets created for genomic research, such as TCGA, for method development, the performance of these techniques on diagnostic slides from clinical practice has been inadequately explored. This study conducts a thorough benchmarking analysis of ten slide-level aggregation techniques across nine clinically relevant tasks, including diagnostic assessment, biomarker classification, and outcome prediction. The results yield following key insights: (1) Embeddings derived from domain-specific (histological images) FMs outperform those from generic ImageNet-based models across aggregation methods. (2) Spatial-aware aggregators enhance the performance significantly when using ImageNet pre-trained models but not when using FMs. (3) No single model excels in all tasks and spatially-aware models do not show general superiority as it would be expected. These findings underscore the need for more adaptable and universally applicable aggregation techniques, guiding future research towards tools that better meet the evolving needs of clinical-AI in pathology. The code used in this work is available at \url{https://github.com/fuchs-lab-public/CPath_SABenchmark}.
Paper Structure (13 sections, 3 figures, 3 tables)

This paper contains 13 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Evolution of Slide Aggregation Methods in CPath (2017 - 2023). We track the progression of aggregation and embedding techniques, categorized by Key Instance, Attention, Cluster, Self-Attention, and Graph-based methods. Models benchmarked in this study are marked with a black outline. Colors and gradient colors denote method categories and their combinations, respectively; vertical placement shows chronological order, and horizontal lines indicate whether spatial information is integrated or not.
  • Figure 2: AUC scores in boxplots from benchmark aggregation methods versus AB-MIL baseline across nine datasets, using two embedding groups. Scores are from 20 Monte Carlo cross-validations, averaged over two random seeds. A one-sided t-test assessed AB-MIL performance comparisons, with symbols indicating significant differences. The dotted orange line shows the AB-MIL average for reference. Methods follow the Figure \ref{['fig1:evolutionary tree']} category order and colors.
  • Figure 3: A: Computational resources vs. tiles per slide; B: Histogram of number of tiles per slide in each dataset; C: Validation AUC during training process for BCa ER. The line is average value of validation AUC and errorbar is calculated by standard error from 20 MCCV runs.