Scaling up self-supervised learning for improved surgical foundation models

Tim J. M. Jaspers; Ronald L. P. D. de Jong; Yiping Li; Carolus H. J. Kusters; Franciscus H. A. Bakker; Romy C. van Jaarsveld; Gino M. Kuiper; Richard van Hillegersberg; Jelle P. Ruurda; Willem M. Brinkman; Josien P. W. Pluim; Peter H. N. de With; Marcel Breeuwer; Yasmina Al Khalil; Fons van der Sommen

Scaling up self-supervised learning for improved surgical foundation models

Tim J. M. Jaspers, Ronald L. P. D. de Jong, Yiping Li, Carolus H. J. Kusters, Franciscus H. A. Bakker, Romy C. van Jaarsveld, Gino M. Kuiper, Richard van Hillegersberg, Jelle P. Ruurda, Willem M. Brinkman, Josien P. W. Pluim, Peter H. N. de With, Marcel Breeuwer, Yasmina Al Khalil, Fons van der Sommen

TL;DR

This work addresses the scarcity and variability of surgical foundation models by introducing SurgeNetXL, a self-supervised, large-scale pretraining approach trained on 4.7 million frames with an additional 2.1 million-frame YouTube extension. It conducts an extensive cross-dataset benchmark across six downstream datasets and three tasks—semantic segmentation, surgical phase recognition, and CVS classification—demonstrating consistent top-tier performance and clear advantages over ImageNet-initialized baselines and prior surgical SSL models. The study reveals that increasing dataset diversity, extending pretraining time, and employing hybrid architectures like CAFormer yield the strongest gains, with under-represented anatomical structures benefiting most from in-domain SSL. It also provides practical insights into data composition and architectural choices, emphasizes the value of the Surgical YouTube dataset, and releases both models and data to spur further research and reproducibility in surgical foundation modeling. Overall, SurgeNetXL advances generalizability and robustness in data-scarce surgical contexts, offering a scalable blueprint for future foundation models in specialized medical domains.

Abstract

Foundation models have revolutionized computer vision by achieving vastly superior performance across diverse tasks through large-scale pretraining on extensive datasets. However, their application in surgical computer vision has been limited. This study addresses this gap by introducing SurgeNetXL, a novel surgical foundation model that sets a new benchmark in surgical computer vision. Trained on the largest reported surgical dataset to date, comprising over 4.7 million video frames, SurgeNetXL achieves consistent top-tier performance across six datasets spanning four surgical procedures and three tasks, including semantic segmentation, phase recognition, and critical view of safety (CVS) classification. Compared with the best-performing surgical foundation models, SurgeNetXL shows mean improvements of 2.4, 9.0, and 12.6 percent for semantic segmentation, phase recognition, and CVS classification, respectively. Additionally, SurgeNetXL outperforms the best-performing ImageNet-based variants by 14.4, 4.0, and 1.6 percent in the respective tasks. In addition to advancing model performance, this study provides key insights into scaling pretraining datasets, extending training durations, and optimizing model architectures specifically for surgical computer vision. These findings pave the way for improved generalizability and robustness in data-scarce scenarios, offering a comprehensive framework for future research in this domain. All models and a subset of the SurgeNetXL dataset, including over 2 million video frames, are publicly available at: https://github.com/TimJaspers0801/SurgeNet.

Scaling up self-supervised learning for improved surgical foundation models

TL;DR

Abstract

Paper Structure (46 sections, 13 figures, 5 tables)

This paper contains 46 sections, 13 figures, 5 tables.

Introduction
Related work
Surgical computer vision
Medical in-domain self-supervised pretraining
Position of our work
Experimental setup
Self-supervised pretraining
SurgeNetXL: a large-scale unlabeled surgical dataset
Surgical YouTube data
Private datasets
SurgeNetXL variations
Pretraining strategy
Model architectures
Downstream network training
Semantic segmentation
...and 31 more sections

Figures (13)

Figure 1: Radar plot showing ranks across datasets and metrics, with results from the four evaluated open-source foundation models and the proposed SurgeNetXL model. A rank of 1 indicates the best performance, while a rank of 5 indicates the worst performance. Semantic segmentation, phase recognition, and classification are shown in orange, green, and blue, respectively. This color-coding is used throughout the rest of the paper to aid clarity and consistency.
Figure 2: Overview of the experimental setup for this study. The black circles indicate the section numbers in the paper where further details about this specific aspect can be found. The datasets are color-coded for clarity, with the red section highlighting the composition of the SurgeNetXL dataset. The semantic segmentation task utilizes three datasets, phase recognition is performed using two datasets, and CVS classification is based on a single dataset. All ablation studies are focused on the semantic segmentation datasets.
Figure 3: Visual overview of all downstream datasets, including semantic segmentation, phase recognition, and classification in orange, green, and blue, respectively.
Figure 4: Ranking stability across all datasets and metrics. The size of each blob is proportional to the relative frequency with which a model architecture achieves a specific rank. The SurgeNetXL model (and its variations) are color-coded in orange. The median rank for each architecture, rounded to the nearest integer, is indicated by a black cross, while 95% bootstrap intervals (spanning the 2.5th to 97.5th percentiles of the bootstrap distribution) are shown as black vertical lines. Models are ordered from left to right, with the best-performing model on the left and the worst on the right, based on the mean rank score across bootstrap samples.
Figure 5: Ranking stability across each dataset and metric. The size of each blob is proportional to the relative frequency with which a model architecture achieves a specific rank. The SurgeNetXL model (and its variations) are color-coded in orange. The median rank for each architecture is indicated by a black cross, while 95% bootstrap intervals (spanning the 2.5th to 97.5th percentiles of the bootstrap distribution) are shown as black vertical lines. Models are ordered from left to right, with the best-performing model on the left and the worst on the right, based on the mean rank score across bootstrap samples. The plot titles indicate the datasets used: semantic segmentation (orange), phase recognition (green), and CVS classification (blue). The model code names displayed on the x-axis are listed in Table \ref{['tab:sota_results']}.
...and 8 more figures

Scaling up self-supervised learning for improved surgical foundation models

TL;DR

Abstract

Scaling up self-supervised learning for improved surgical foundation models

Authors

TL;DR

Abstract

Table of Contents

Figures (13)