Table of Contents
Fetching ...

Benchmarking Foundation Models for Mitotic Figure Classification

Jonas Ammeling, Jonathan Ganz, Emely Rosbach, Ludwig Lausser, Christof A. Bertram, Katharina Breininger, Marc Aubreville

TL;DR

This paper tackles the challenge of mitotic figure classification in pathology under limited labeled data by leveraging self-supervised foundation models. It systematically benchmarks multiple pathology foundation models against linear probing and LoRA adaptations on CCMCT and MIDOG 2022, including dataset-scaling and cross-domain experiments. The results show that LoRA-adapted foundation models deliver superior data efficiency, approaching full-data performance with as little as 10% of the training data, and substantially reduce cross-domain gaps for unseen tumor domains. These findings highlight the practicality of parameter-efficient fine-tuning in clinical contexts and motivate broader benchmarking across tasks and datasets to guide deployment of foundation-model-based pathology solutions.

Abstract

The performance of deep learning models is known to scale with data quantity and diversity. In pathology, as in many other medical imaging domains, the availability of labeled images for a specific task is often limited. Self-supervised learning techniques have enabled the use of vast amounts of unlabeled data to train large-scale neural networks, i.e., foundation models, that can address the limited data problem by providing semantically rich feature vectors that can generalize well to new tasks with minimal training effort increasing model performance and robustness. In this work, we investigate the use of foundation models for mitotic figure classification. The mitotic count, which can be derived from this classification task, is an independent prognostic marker for specific tumors and part of certain tumor grading systems. In particular, we investigate the data scaling laws on multiple current foundation models and evaluate their robustness to unseen tumor domains. Next to the commonly used linear probing paradigm, we also adapt the models using low-rank adaptation (LoRA) of their attention mechanisms. We compare all models against end-to-end-trained baselines, both CNNs and Vision Transformers. Our results demonstrate that LoRA-adapted foundation models provide superior performance to those adapted with standard linear probing, reaching performance levels close to 100% data availability with only 10% of training data. Furthermore, LoRA-adaptation of the most recent foundation models almost closes the out-of-domain performance gap when evaluated on unseen tumor domains. However, full fine-tuning of traditional architectures still yields competitive performance.

Benchmarking Foundation Models for Mitotic Figure Classification

TL;DR

This paper tackles the challenge of mitotic figure classification in pathology under limited labeled data by leveraging self-supervised foundation models. It systematically benchmarks multiple pathology foundation models against linear probing and LoRA adaptations on CCMCT and MIDOG 2022, including dataset-scaling and cross-domain experiments. The results show that LoRA-adapted foundation models deliver superior data efficiency, approaching full-data performance with as little as 10% of the training data, and substantially reduce cross-domain gaps for unseen tumor domains. These findings highlight the practicality of parameter-efficient fine-tuning in clinical contexts and motivate broader benchmarking across tasks and datasets to guide deployment of foundation-model-based pathology solutions.

Abstract

The performance of deep learning models is known to scale with data quantity and diversity. In pathology, as in many other medical imaging domains, the availability of labeled images for a specific task is often limited. Self-supervised learning techniques have enabled the use of vast amounts of unlabeled data to train large-scale neural networks, i.e., foundation models, that can address the limited data problem by providing semantically rich feature vectors that can generalize well to new tasks with minimal training effort increasing model performance and robustness. In this work, we investigate the use of foundation models for mitotic figure classification. The mitotic count, which can be derived from this classification task, is an independent prognostic marker for specific tumors and part of certain tumor grading systems. In particular, we investigate the data scaling laws on multiple current foundation models and evaluate their robustness to unseen tumor domains. Next to the commonly used linear probing paradigm, we also adapt the models using low-rank adaptation (LoRA) of their attention mechanisms. We compare all models against end-to-end-trained baselines, both CNNs and Vision Transformers. Our results demonstrate that LoRA-adapted foundation models provide superior performance to those adapted with standard linear probing, reaching performance levels close to 100% data availability with only 10% of training data. Furthermore, LoRA-adaptation of the most recent foundation models almost closes the out-of-domain performance gap when evaluated on unseen tumor domains. However, full fine-tuning of traditional architectures still yields competitive performance.

Paper Structure

This paper contains 23 sections, 2 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: Benchmark study overview. a) Exemplary overview of datasets. Green shows mitotic figures and yellow shows hard negatives. During inference we extract patches of size $224\times224$ around these annotations for evaluation. b) Overview of dataset scaling experiments. c) Schematic overview of the cross-domain experiment. We train a model on each domain separately and evaluate across all domains. d) Overview of evaluated methods.
  • Figure 2: Results of the data scaling experiment on the CCMCT dataset. (*) indicates statistical significance ($\alpha < 0.05$) between the pooled scores of LoRA and LinProb models.
  • Figure 3: Results of the data scaling experiment on the MIDOG 2022 dataset. (*) indicates statistical significance ($\alpha < 0.05$) between the pooled scores of LoRA and LinProb models.
  • Figure 4: Results of the cross-domain experiment. We show each individual scenario with its averaged AUROC score over all training sessions. A: canine mast cell tumor. B: canine lymphoma. C: human breast cancer. D: human neuroendocrine tumor. E: canine lung cancer.