SurgFed: Language-guided Multi-Task Federated Learning for Surgical Video Understanding

Zheng Fang; Ziwei Niu; Ziyue Wang; Zhu Zhuo; Haofeng Liu; Shuyang Qian; Jun Xia; Yueming Jin

SurgFed: Language-guided Multi-Task Federated Learning for Surgical Video Understanding

Zheng Fang, Ziwei Niu, Ziyue Wang, Zhu Zhuo, Haofeng Liu, Shuyang Qian, Jun Xia, Yueming Jin

TL;DR

SurgFed, a multi-task federated learning framework, enabling federated learning for surgical scene segmentation and depth estimation across diverse surgical types, is proposed and extensive empirical evidence shows that SurgFed yields improvements over the state-of-the-art methods in five public datasets across four surgical types.

Abstract

Surgical scene Multi-Task Federated Learning (MTFL) is essential for robot-assisted minimally invasive surgery (RAS) but remains underexplored in surgical video understanding due to two key challenges: (1) Tissue Diversity: Local models struggle to adapt to site-specific tissue features, limiting their effectiveness in heterogeneous clinical environments and leading to poor local predictions. (2) Task Diversity: Server-side aggregation, relying solely on gradient-based clustering, often produces suboptimal or incorrect parameter updates due to inter-site task heterogeneity, resulting in inaccurate localization. In light of these two issues, we propose SurgFed, a multi-task federated learning framework, enabling federated learning for surgical scene segmentation and depth estimation across diverse surgical types. SurgFed is powered by two appealing designs, i.e., Language-guided Channel Selection (LCS) and Language-guided Hyper Aggregation (LHA), to address the challenge of fully exploration on corss-site and cross-task. Technically, the LCS is first designed a lightweight personalized channel selection network that enhances site-specific adaptation using pre-defined text inputs, which optimally the local model learn the specific embeddings. We further introduce the LHA that employs a layer-wise cross-attention mechanism with pre-defined text inputs to model task interactions across sites and guide a hypernetwork for personalized parameter updates. Extensive empirical evidence shows that SurgFed yields improvements over the state-of-the-art methods in five public datasets across four surgical types. The code is available at https://anonymous.4open.science/r/SurgFed-070E/.

SurgFed: Language-guided Multi-Task Federated Learning for Surgical Video Understanding

TL;DR

Abstract

Paper Structure (11 sections, 12 equations, 4 figures, 4 tables)

This paper contains 11 sections, 12 equations, 4 figures, 4 tables.

INTRODUCTION
Related Work
Methodology
Overall Pipeline
Language-guided Channel Selection (LCS)
Language-guided Hyper Aggregation (LHA)
Experiments
Experiment Setting
Experimental Results and Empirical Analysis
Ablation Study
CONCLUSIONS

Figures (4)

Figure 1: Comparison of different federated learning paradigms. Traditional FL fails to handle diverse surgical tasks, while Multi-Task FL supports multiple objectives but lacks domain-specific guidance. Our proposed SurgFed incorporates both multi-task capability and surgical prior, enabling personalization across different surgical scenarios.
Figure 2: In the local-side stage, LCS allows each local model to adapt to local data through personalized selection and enhancement of feature-specific channels. In the server-side stage, LHA models the task interactions across different sites, enabling a personalized update to each local site model.
Figure 3: Impact on Fine-Tuning Dec and Mem Layers of SAM2.
Figure 4: Segmentation and depth estimation visualization results across five sites.

SurgFed: Language-guided Multi-Task Federated Learning for Surgical Video Understanding

TL;DR

Abstract

SurgFed: Language-guided Multi-Task Federated Learning for Surgical Video Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (4)