Table of Contents
Fetching ...

Low-resource finetuning of foundation models beats state-of-the-art in histopathology

Benedikt Roth, Valentin Koch, Sophia J. Wagner, Julia A. Schnabel, Carsten Marr, Tingying Peng

TL;DR

Histopathology analysis via whole-slide and patch-level data requires robust feature representations. The authors benchmark four vision foundation models and demonstrate that fine-tuning the DINOv2 ViT-S on task-specific CRC data yields performance competitive with or superior to state-of-the-art domain-specific extractors, while using drastically less compute on a single GPU. Across slide-level MSI detection (TCGA/CPTAC) and patch-level classification (NCT-CRC), the finetuned ViT-S often outperforms larger models (ViT-g) and matches or surpasses CTransPath and RetCCL, with orders-of-magnitude reductions in training time (e.g., two hours or three days). The work enables broader access to strong histopathology representations and provides public code and finetuned models for reproducible use and further benchmarking.

Abstract

To handle the large scale of whole slide images in computational pathology, most approaches first tessellate the images into smaller patches, extract features from these patches, and finally aggregate the feature vectors with weakly-supervised learning. The performance of this workflow strongly depends on the quality of the extracted features. Recently, foundation models in computer vision showed that leveraging huge amounts of data through supervised or self-supervised learning improves feature quality and generalizability for a variety of tasks. In this study, we benchmark the most popular vision foundation models as feature extractors for histopathology data. We evaluate the models in two settings: slide-level classification and patch-level classification. We show that foundation models are a strong baseline. Our experiments demonstrate that by finetuning a foundation model on a single GPU for only two hours or three days depending on the dataset, we can match or outperform state-of-the-art feature extractors for computational pathology. These findings imply that even with little resources one can finetune a feature extractor tailored towards a specific downstream task and dataset. This is a considerable shift from the current state, where only few institutions with large amounts of resources and datasets are able to train a feature extractor. We publish all code used for training and evaluation as well as the finetuned models.

Low-resource finetuning of foundation models beats state-of-the-art in histopathology

TL;DR

Histopathology analysis via whole-slide and patch-level data requires robust feature representations. The authors benchmark four vision foundation models and demonstrate that fine-tuning the DINOv2 ViT-S on task-specific CRC data yields performance competitive with or superior to state-of-the-art domain-specific extractors, while using drastically less compute on a single GPU. Across slide-level MSI detection (TCGA/CPTAC) and patch-level classification (NCT-CRC), the finetuned ViT-S often outperforms larger models (ViT-g) and matches or surpasses CTransPath and RetCCL, with orders-of-magnitude reductions in training time (e.g., two hours or three days). The work enables broader access to strong histopathology representations and provides public code and finetuned models for reproducible use and further benchmarking.

Abstract

To handle the large scale of whole slide images in computational pathology, most approaches first tessellate the images into smaller patches, extract features from these patches, and finally aggregate the feature vectors with weakly-supervised learning. The performance of this workflow strongly depends on the quality of the extracted features. Recently, foundation models in computer vision showed that leveraging huge amounts of data through supervised or self-supervised learning improves feature quality and generalizability for a variety of tasks. In this study, we benchmark the most popular vision foundation models as feature extractors for histopathology data. We evaluate the models in two settings: slide-level classification and patch-level classification. We show that foundation models are a strong baseline. Our experiments demonstrate that by finetuning a foundation model on a single GPU for only two hours or three days depending on the dataset, we can match or outperform state-of-the-art feature extractors for computational pathology. These findings imply that even with little resources one can finetune a feature extractor tailored towards a specific downstream task and dataset. This is a considerable shift from the current state, where only few institutions with large amounts of resources and datasets are able to train a feature extractor. We publish all code used for training and evaluation as well as the finetuned models.
Paper Structure (8 sections, 2 figures, 3 tables)

This paper contains 8 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: We propose finetuning a DINOv2 ViT-S, which yields at least equal performance compared to CTransPath and RetCCL but in a fraction of domain specific training time. Performance is measured on three datasets: TCGA & CPTAC (WSI-level classification) and NCT-CRC (patch-level classification).
  • Figure 2: Performance over time of finetuning a ViT-s with DINOv2: a) on NCT-CRC and evaluating on the external NCT-CRC testset on patch-level classification and b) on TCGA and testing on TCGA (5-fold cross-validation) and CPTAC (external testset) on WSI-level classification.