Table of Contents
Fetching ...

A Comprehensive Evaluation of Histopathology Foundation Models for Ovarian Cancer Subtype Classification

Jack Breen, Katie Allen, Kieran Zucker, Lucy Godson, Nicolas M. Orsi, Nishant Ravikumar

TL;DR

The most rigorous single-task validation study to date in the context of ovarian carcinoma morphological subtyping is reported, specifically in the context of ovarian carcinoma morphological subtyping.

Abstract

Large pretrained transformers are increasingly being developed as generalised foundation models which can underpin powerful task-specific artificial intelligence models. Histopathology foundation models show great promise across many tasks, but analyses have typically been limited by arbitrary hyperparameters that were not tuned to the specific task. We report the most rigorous single-task validation of histopathology foundation models to date, specifically in ovarian cancer morphological subtyping. Attention-based multiple instance learning classifiers were compared using three ImageNet-pretrained feature extractors and fourteen histopathology foundation models. The training set consisted of 1864 whole slide images from 434 ovarian carcinoma cases at Leeds Teaching Hospitals NHS Trust. Five-class classification performance was evaluated through five-fold cross-validation, and these cross-validation models were ensembled for hold-out testing and external validation on the Transcanadian Study and OCEAN Challenge datasets. The best-performing model used the H-optimus-0 foundation model, with five-class balanced accuracies of 89%, 97%, and 74% in the test sets. Normalisations and augmentations aided the performance of the ImageNet-pretrained ResNets, but these were still outperformed by 13 of the 14 foundation models. Hyperparameter tuning the downstream classifiers improved performance by a median 1.9% balanced accuracy, with many improvements being statistically significant. Histopathology foundation models offer a clear benefit to ovarian cancer subtyping, improving classification performance to a degree where clinical utility is tangible, albeit with an increased computational burden. Such models could provide a second opinion to histopathologists diagnosing challenging cases and may improve the accuracy, objectivity, and efficiency of pathological diagnoses overall.

A Comprehensive Evaluation of Histopathology Foundation Models for Ovarian Cancer Subtype Classification

TL;DR

The most rigorous single-task validation study to date in the context of ovarian carcinoma morphological subtyping is reported, specifically in the context of ovarian carcinoma morphological subtyping.

Abstract

Large pretrained transformers are increasingly being developed as generalised foundation models which can underpin powerful task-specific artificial intelligence models. Histopathology foundation models show great promise across many tasks, but analyses have typically been limited by arbitrary hyperparameters that were not tuned to the specific task. We report the most rigorous single-task validation of histopathology foundation models to date, specifically in ovarian cancer morphological subtyping. Attention-based multiple instance learning classifiers were compared using three ImageNet-pretrained feature extractors and fourteen histopathology foundation models. The training set consisted of 1864 whole slide images from 434 ovarian carcinoma cases at Leeds Teaching Hospitals NHS Trust. Five-class classification performance was evaluated through five-fold cross-validation, and these cross-validation models were ensembled for hold-out testing and external validation on the Transcanadian Study and OCEAN Challenge datasets. The best-performing model used the H-optimus-0 foundation model, with five-class balanced accuracies of 89%, 97%, and 74% in the test sets. Normalisations and augmentations aided the performance of the ImageNet-pretrained ResNets, but these were still outperformed by 13 of the 14 foundation models. Hyperparameter tuning the downstream classifiers improved performance by a median 1.9% balanced accuracy, with many improvements being statistically significant. Histopathology foundation models offer a clear benefit to ovarian cancer subtyping, improving classification performance to a degree where clinical utility is tangible, albeit with an increased computational burden. Such models could provide a second opinion to histopathologists diagnosing challenging cases and may improve the accuracy, objectivity, and efficiency of pathological diagnoses overall.
Paper Structure (7 sections, 16 figures, 19 tables)

This paper contains 7 sections, 16 figures, 19 tables.

Figures (16)

  • Figure 1: Attention-based multiple instance learning (ABMIL) Ilse2018 model pipeline for ovarian cancer subtyping, showing the classification of a high-grade serous carcinoma (HGSC).
  • Figure 2: Ovarian cancer subtyping results for each feature extractor (mean and 95% confidence interval generated by 10,000 iterations of bootstrapping). Blue indicates ImageNet-pretrained feature extractors, orange indicates histopathology foundation models. Hold-out testing and external validation results are based on an ensemble of cross-validation models. Precise values are tabulated in \ref{['app:results']}.
  • Figure 3: Confusion matrices for the optimal ABMIL classifier with features from the H-optimus-0 foundation model. Correct classifications are indicated in green.
  • Figure 4: Balanced accuracy results for each ImageNet-pretrained feature extractor (blue), including the seven ResNet50 models with varied preprocessing techniques (green), as well as the three worst-performing (RN18-Histo, RN50-Histo, and CTransPath) and the best-performing foundation models (H-optimus-0) in (a) cross-validation, (b) hold-out testing, (c) external validation on the Transcanadian Study dataset, (d) external validation on the OCEAN Challenge dataset. For validations (b)-(d), predictions were ensembled from the five cross-validation models. Results reported as the mean and 95% confidence interval generated by 10,000 iterations of bootstrapping. Precise values and other metric results are tabulated in \ref{['app:augmentation']}.
  • Figure 5: Balanced accuracy results for each model compared with the ABMIL classifier trained using the default hyperparameters (pink) and the tuned hyperparameters (blue) for (a) cross-validation, (b) hold-out testing, (c) external validation on the Transcanadian Study dataset, (d) external validation on the OCEAN Challenge dataset. For validations (b)-(d), predictions were ensembled from the five cross-validation models. *Indicates a significant difference in the paired t-test at the 5% significance level.
  • ...and 11 more figures