Table of Contents
Fetching ...

Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge

Max Kirchner, Hanna Hoffmann, Alexander C. Jenke, Oliver L. Saldanha, Kevin Pfeiffer, Weam Kanjo, Julia Alekseenko, Claas de Boer, Santhi Raj Kolamuri, Lorenzo Mazza, Nicolas Padoy, Sophia Bano, Annika Reinke, Lena Maier-Hein, Danail Stoyanov, Jakob N. Kather, Fiona R. Kolbinger, Sebastian Bodenstedt, Stefanie Speidel

TL;DR

This paper presents FedSurg, the first federated learning benchmark for surgical video classification on appendicitis staging, using a preliminary Appendix300 dataset across four centers to compare unseen-center generalization with center-specific adaptation under privacy constraints. Three teams implemented distinct FL strategies—ViViT-based linear probing, EndoViT-based frame-wise prediction with adaptive optimization, and a Siamese metric-learning approach with robust aggregation—and were evaluated with macro F1-score and Expected Cost, complemented by bootstrapping and Wilcoxon testing. Results reveal limited cross-center generalization, gains from local fine-tuning that are often unstable across centers, and the effectiveness of spatiotemporal modeling and context-aware preprocessing, while also highlighting data imbalance and hyperparameter-tuning challenges in decentralized settings. The study provides the first concrete benchmark and guidance for developing imbalance-aware, adaptive, and robust FL methods in surgical AI, outlining practical directions such as domain adaptation, uncertainty-aware inference, and self-supervised pretraining. FedSurg thus establishes a reference point for future improvements in privacy-preserving, patient-level surgical video analysis.

Abstract

Purpose: The FedSurg challenge was designed to benchmark the state of the art in federated learning for surgical video classification. Its goal was to assess how well current methods generalize to unseen clinical centers and adapt through local fine-tuning while enabling collaborative model development without sharing patient data. Methods: Participants developed strategies to classify inflammation stages in appendicitis using a preliminary version of the multi-center Appendix300 video dataset. The challenge evaluated two tasks: generalization to an unseen center and center-specific adaptation after fine-tuning. Submitted approaches included foundation models with linear probing, metric learning with triplet loss, and various FL aggregation schemes (FedAvg, FedMedian, FedSAM). Performance was assessed using F1-score and Expected Cost, with ranking robustness evaluated via bootstrapping and statistical testing. Results: In the generalization task, performance across centers was limited. In the adaptation task, all teams improved after fine-tuning, though ranking stability was low. The ViViT-based submission achieved the strongest overall performance. The challenge highlighted limitations in generalization, sensitivity to class imbalance, and difficulties in hyperparameter tuning in decentralized training, while spatiotemporal modeling and context-aware preprocessing emerged as promising strategies. Conclusion: The FedSurg Challenge establishes the first benchmark for evaluating FL strategies in surgical video classification. Findings highlight the trade-off between local personalization and global robustness, and underscore the importance of architecture choice, preprocessing, and loss design. This benchmarking offers a reference point for future development of imbalance-aware, adaptive, and robust FL methods in clinical surgical AI.

Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge

TL;DR

This paper presents FedSurg, the first federated learning benchmark for surgical video classification on appendicitis staging, using a preliminary Appendix300 dataset across four centers to compare unseen-center generalization with center-specific adaptation under privacy constraints. Three teams implemented distinct FL strategies—ViViT-based linear probing, EndoViT-based frame-wise prediction with adaptive optimization, and a Siamese metric-learning approach with robust aggregation—and were evaluated with macro F1-score and Expected Cost, complemented by bootstrapping and Wilcoxon testing. Results reveal limited cross-center generalization, gains from local fine-tuning that are often unstable across centers, and the effectiveness of spatiotemporal modeling and context-aware preprocessing, while also highlighting data imbalance and hyperparameter-tuning challenges in decentralized settings. The study provides the first concrete benchmark and guidance for developing imbalance-aware, adaptive, and robust FL methods in surgical AI, outlining practical directions such as domain adaptation, uncertainty-aware inference, and self-supervised pretraining. FedSurg thus establishes a reference point for future improvements in privacy-preserving, patient-level surgical video analysis.

Abstract

Purpose: The FedSurg challenge was designed to benchmark the state of the art in federated learning for surgical video classification. Its goal was to assess how well current methods generalize to unseen clinical centers and adapt through local fine-tuning while enabling collaborative model development without sharing patient data. Methods: Participants developed strategies to classify inflammation stages in appendicitis using a preliminary version of the multi-center Appendix300 video dataset. The challenge evaluated two tasks: generalization to an unseen center and center-specific adaptation after fine-tuning. Submitted approaches included foundation models with linear probing, metric learning with triplet loss, and various FL aggregation schemes (FedAvg, FedMedian, FedSAM). Performance was assessed using F1-score and Expected Cost, with ranking robustness evaluated via bootstrapping and statistical testing. Results: In the generalization task, performance across centers was limited. In the adaptation task, all teams improved after fine-tuning, though ranking stability was low. The ViViT-based submission achieved the strongest overall performance. The challenge highlighted limitations in generalization, sensitivity to class imbalance, and difficulties in hyperparameter tuning in decentralized training, while spatiotemporal modeling and context-aware preprocessing emerged as promising strategies. Conclusion: The FedSurg Challenge establishes the first benchmark for evaluating FL strategies in surgical video classification. Findings highlight the trade-off between local personalization and global robustness, and underscore the importance of architecture choice, preprocessing, and loss design. This benchmarking offers a reference point for future development of imbalance-aware, adaptive, and robust FL methods in clinical surgical AI.

Paper Structure

This paper contains 30 sections, 6 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: FedSurg24 Challenge Highlights: The top panel shows example images of intraoperative appendicitis grades, defined according to Gomes et al. gomes_laparoscopy_2012, which were used for video annotation. The lower panel illustrates the FedSurg Challenge workflow: teams submitted Docker containers via Synapse, which were executed on a secure cluster simulating FL across three centers with local training and centralized aggregation. Final performance was assessed by testing each center’s best local model on its own test set, while the global model was evaluated on the unseen hold-out center to measure generalization. The challenge timeline with key dates is shown alongside.
  • Figure 2: Label Distribution Across Data Subsets per Center. Label distributions for (a) the training dataset and (b) the test dataset across the four centers. The plots highlight notable inter-center variability and class imbalance. In the training set visualization, the darker segments represent the publicly available subset for participant development, while the lighter segments show the complete dataset used for final federated training. The exact data distribution was unknown for the participants.
  • Figure 3: Methods Overview: The three submissions shown utilize different backbone architectures and federated strategies. A common approach is that in each server round, the best-performing model from a client's local training rounds is sent to the server for aggregation. (a) Team Santhi uses a frozen ViViT backbone with a fine-tuned classification head processing 32 frames per video, with updates aggregated via FedAvg. (b) Team Elbflorenz uses a frozen EndoViT backbone with a fine-tuned head, predicting single frames repeatedly and combining them via majority voting, with updates aggregated via FedSAM. (c) Team Camma uses ResNet50 models trained with a contrastive approach on positive and negative pairs, with updates aggregated via FedMedian. At inference, classification is performed by comparing the test embedding to a support set.
  • Figure 4: Confusion Matrices – Task 1, Center 4. Confusion matrices for the participating teams on Center 4 (Task 1). The values in the confusion matrices are not normalized. The color highlighting is normalized row-wise by true labels. The diagonal highlights class-wise recall, while off-diagonal values indicate common misclassification patterns.
  • Figure 5: Bootstrapped Performance Results. Visualization of the performance results with standard deviation as error bars for all teams and tasks after bootstrapping with 10,000 repetitions. The plot illustrates the variability and stability of the outcomes across different centers.
  • ...and 2 more figures