Is Self-Supervision Enough? Benchmarking Foundation Models Against End-to-End Training for Mitotic Figure Classification
Jonathan Ganz, Jonas Ammeling, Emely Rosbach, Ludwig Lausser, Christof A. Bertram, Katharina Breininger, Marc Aubreville
TL;DR
The paper asks whether foundation-model embeddings can replace or reduce supervision for mitotic-figure classification in histopathology. It compares six ViT-based FMs pretrained with self-supervision against ImageNet baselines and an end-to-end ResNet50, evaluating performance with linear probing on MF patches across two public datasets and multiple data regimes, including cross-domain splits. The main finding is that end-to-end ResNet50 consistently outperforms FM-based classifiers, and FM embeddings do not confer reliable data-efficiency or domain robustness for this task; some FMs approach the performance of others but none beat the end-to-end baseline. The work highlights the limits of FM pretraining for specialized histopathology tasks and suggests future work on task-specific fine-tuning or larger, more diverse histology datasets to harness FM representations effectively.
Abstract
Foundation models (FMs), i.e., models trained on a vast amount of typically unlabeled data, have become popular and available recently for the domain of histopathology. The key idea is to extract semantically rich vectors from any input patch, allowing for the use of simple subsequent classification networks potentially reducing the required amounts of labeled data, and increasing domain robustness. In this work, we investigate to which degree this also holds for mitotic figure classification. Utilizing two popular public mitotic figure datasets, we compared linear probing of five publicly available FMs against models trained on ImageNet and a simple ResNet50 end-to-end-trained baseline. We found that the end-to-end-trained baseline outperformed all FM-based classifiers, regardless of the amount of data provided. Additionally, we did not observe the FM-based classifiers to be more robust against domain shifts, rendering both of the above assumptions incorrect.
