Table of Contents
Fetching ...

Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections

Max Kirchner, Alexander C. Jenke, Sebastian Bodenstedt, Fiona R. Kolbinger, Oliver L. Saldanha, Jakob N. Kather, Martin Wagner, Stefanie Speidel

TL;DR

Federated EndoViT investigates privacy-preserving training of surgical foundation models via Federated Learning (FL). The authors adapt Masked Autoencoder pretraining to a federated setting with adaptive FedSAM and server-side SWA, training on Endo700k and evaluating on downstream tasks (Surgical Scene Segmentation, Action Triplet Recognition, Surgical Phase Recognition). They demonstrate that FedSAM improves pretraining reconstruction and that FL-EndoViT attains performance comparable to centralized EndoViT on ATR and SPR, and can exceed centralized performance on SSS when data is limited or high-resolution imagery is used. The work highlights FL as a viable path to robust, privacy-preserving surgical data science models and points to future extensions into video-based FL.

Abstract

Purpose: In this study, we investigate the training of foundation models using federated learning to address data-sharing limitations and enable collaborative model training without data transfer for minimally invasive surgery. Methods: Inspired by the EndoViT study, we adapt the Masked Autoencoder for federated learning, enhancing it with adaptive Sharpness-Aware Minimization (FedSAM) and Stochastic Weight Averaging (SWA). Our model is pretrained on the Endo700k dataset collection and later fine-tuned and evaluated for tasks such as Semantic Segmentation, Action Triplet Recognition, and Surgical Phase Recognition. Results: Our findings demonstrate that integrating adaptive FedSAM into the federated MAE approach improves pretraining, leading to a reduction in reconstruction loss per patch. The application of FL-EndoViT in surgical downstream tasks results in performance comparable to CEN-EndoViT. Furthermore, FL-EndoViT exhibits advantages over CEN-EndoViT in surgical scene segmentation when data is limited and in action triplet recognition when large datasets are used. Conclusion: These findings highlight the potential of federated learning for privacy-preserving training of surgical foundation models, offering a robust and generalizable solution for surgical data science. Effective collaboration requires adapting federated learning methods, such as the integration of FedSAM, which can accommodate the inherent data heterogeneity across institutions. In future, exploring FL in video-based models may enhance these capabilities by incorporating spatiotemporal dynamics crucial for real-world surgical environments.

Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections

TL;DR

Federated EndoViT investigates privacy-preserving training of surgical foundation models via Federated Learning (FL). The authors adapt Masked Autoencoder pretraining to a federated setting with adaptive FedSAM and server-side SWA, training on Endo700k and evaluating on downstream tasks (Surgical Scene Segmentation, Action Triplet Recognition, Surgical Phase Recognition). They demonstrate that FedSAM improves pretraining reconstruction and that FL-EndoViT attains performance comparable to centralized EndoViT on ATR and SPR, and can exceed centralized performance on SSS when data is limited or high-resolution imagery is used. The work highlights FL as a viable path to robust, privacy-preserving surgical data science models and points to future extensions into video-based FL.

Abstract

Purpose: In this study, we investigate the training of foundation models using federated learning to address data-sharing limitations and enable collaborative model training without data transfer for minimally invasive surgery. Methods: Inspired by the EndoViT study, we adapt the Masked Autoencoder for federated learning, enhancing it with adaptive Sharpness-Aware Minimization (FedSAM) and Stochastic Weight Averaging (SWA). Our model is pretrained on the Endo700k dataset collection and later fine-tuned and evaluated for tasks such as Semantic Segmentation, Action Triplet Recognition, and Surgical Phase Recognition. Results: Our findings demonstrate that integrating adaptive FedSAM into the federated MAE approach improves pretraining, leading to a reduction in reconstruction loss per patch. The application of FL-EndoViT in surgical downstream tasks results in performance comparable to CEN-EndoViT. Furthermore, FL-EndoViT exhibits advantages over CEN-EndoViT in surgical scene segmentation when data is limited and in action triplet recognition when large datasets are used. Conclusion: These findings highlight the potential of federated learning for privacy-preserving training of surgical foundation models, offering a robust and generalizable solution for surgical data science. Effective collaboration requires adapting federated learning methods, such as the integration of FedSAM, which can accommodate the inherent data heterogeneity across institutions. In future, exploring FL in video-based models may enhance these capabilities by incorporating spatiotemporal dynamics crucial for real-world surgical environments.

Paper Structure

This paper contains 22 sections, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Federated EndoViT Framework: A two-stage approach for training a surgical Foundation Model using Federated Learning. (I) The model undergoes federated self-supervised pretraining with a Masked Autoencoder strategy, where 75% of the 256 image patches are masked and reconstructed. Adaptive Federated Sharpness-Aware Minimization is applied on the client side, while Stochastic Weight Averaging is used on the server side to enhance generalization across centers. (II) The pretrained EndoViT encoder serves as a Foundation Model, enabling fine-tuning and feature extraction for diverse surgical downstream tasks. This approach builds upon the work of Batić et al. batic2024endovit.
  • Figure 2: Pretraining Results Over 15 Epochs Showing Number of Patches below Different Reconstruction Loss Thresholds: This figure illustrates the number of patches (out of a total of 256) with reconstruction losses falling below thresholds of 0.3, 0.1, 0.05, and 0.01 over 15 epochs, representing reconstruction performance. Each line in the figure tracks the count of patches under specific thresholds, along with SWA (Stochastic Weight Averaging) and observed maximum values. Higher counts of patches below a threshold indicate improved reconstruction accuracy. The figures presented here are derived from the pretraining of the SSS variant. A similar pattern is observed for ATR and SPR.
  • Figure 3: Performance Comparison of Surgical Scene Segmentation on High-Resolution Images (Fully Fine-Tuned). The violin plots illustrate the distribution of IoU (Intersection over Union) scores for 1,040 test images. The left half of the violin plots the federated variant, while the right half plots the centralized variant. The split violins are color-coded: purple indicates a significant improvement in performance with the federated backbone model, orange indicates a significant improvement in performance with the centralized model, and gray indicates no significant difference, as measured by a Wilcoxon signed-rank test.
  • Figure 4: Performance Comparison of Surgical Scene Segmentation on Low-Resolution Images (Fully Fine-Tuned). The violin plots illustrate the distribution of IoU (Intersection over Union) scores for 1,040 test images. The left half of the violin plots the federated variant, while the right half plots the centralized variant. The split violins are color-coded: purple indicates a significant improvement in performance with the federated backbone model, orange indicates a significant improvement in performance with the centralized model, and gray indicates no significant difference, as measured by a Wilcoxon signed-rank test.
  • Figure 5: Performance Comparison of Action Triplet Recognition (Fully Fine-Tuned). The violin plots show average precision scores distribution for nine test videos. The left half of the violin plots the federated variant, while the right half plots the centralized variant. The split violins are color-coded: purple indicates significantly better performance with the federated backbone model, orange indicates significantly better performance with the centralized model, and gray indicates no significant difference, as measured by a Wilcoxon signed-rank test.
  • ...and 8 more figures