Table of Contents
Fetching ...

SurgFusion-Net: Diversified Adaptive Multimodal Fusion Network for Surgical Skill Assessment

Runlong He, Freweini M. Tesfai, Matthew W. E. Boal, Nazir Sirajudeen, Dimitrios Anastasiou, Jialang Xu, Mobarak I. Hoque, Philip J. Edwards, John D. Kelly, Ashwin Sridhar, Abdolrahim Kadkhodamohammadi, Dhivya Chandrasekaran, Matthew J. Clarkson, Danail Stoyanov, Nader Francis, Evangelos B. Mazomenos

TL;DR

SurgFusion-Net and Divergence Regulated Attention (DRA) is introduced, an innovative fusion strategy for multimodal surgical skill assessment that incorporates adaptive dual attention and diversity-promoting multi-head attention to fuse multimodal information, from three modalities, based on surgical context, enhancing assessment accuracy and reliability.

Abstract

Robotic-assisted surgery (RAS) is established in clinical practice, and automated surgical skill assessment utilizing multimodal data offers transformative potential for surgical analytics and education. However, developing effective multimodal methods remains challenging due to the task complexity, limited annotated datasets and insufficient techniques for cross-modal information fusion. Existing state-of-the-art relies exclusively on RGB video and only applies on dry-lab settings, failing to address the significant domain gap between controlled simulation and real clinical cases, where the surgical environment together with camera and tissue motion introduce substantial complexities. This work introduces SurgFusion-Net and Divergence Regulated Attention (DRA), an innovative fusion strategy for multimodal surgical skill assessment. We contribute two first-of-their-kind clinical datasets: the RAH-skill dataset containing 279,691 RGB frames from 37 videos of Robot-assisted Hysterectomy (RAH), and the RARP-skill dataset containing 70,661 RGB frames from 33 videos of Robot-Assisted Radical Prostatectomy (RARP). Both datasets include M-GEARS skill annotations, corresponding optical flow and tool segmentation masks. DRA incorporates adaptive dual attention and diversity-promoting multi-head attention to fuse multimodal information, from three modalities, based on surgical context, enhancing assessment accuracy and reliability. Validated on the JIGSAWS benchmark, RAH-skill, and RARP-skill datasets, our approach outperforms recent baselines with SCC improvements of 0.02 in LOSO, 0.04 in LOUO across JIGSAWS tasks, and 0.0538 and 0.0493 gains on RAH-skill and RARP-skill, respectively.

SurgFusion-Net: Diversified Adaptive Multimodal Fusion Network for Surgical Skill Assessment

TL;DR

SurgFusion-Net and Divergence Regulated Attention (DRA) is introduced, an innovative fusion strategy for multimodal surgical skill assessment that incorporates adaptive dual attention and diversity-promoting multi-head attention to fuse multimodal information, from three modalities, based on surgical context, enhancing assessment accuracy and reliability.

Abstract

Robotic-assisted surgery (RAS) is established in clinical practice, and automated surgical skill assessment utilizing multimodal data offers transformative potential for surgical analytics and education. However, developing effective multimodal methods remains challenging due to the task complexity, limited annotated datasets and insufficient techniques for cross-modal information fusion. Existing state-of-the-art relies exclusively on RGB video and only applies on dry-lab settings, failing to address the significant domain gap between controlled simulation and real clinical cases, where the surgical environment together with camera and tissue motion introduce substantial complexities. This work introduces SurgFusion-Net and Divergence Regulated Attention (DRA), an innovative fusion strategy for multimodal surgical skill assessment. We contribute two first-of-their-kind clinical datasets: the RAH-skill dataset containing 279,691 RGB frames from 37 videos of Robot-assisted Hysterectomy (RAH), and the RARP-skill dataset containing 70,661 RGB frames from 33 videos of Robot-Assisted Radical Prostatectomy (RARP). Both datasets include M-GEARS skill annotations, corresponding optical flow and tool segmentation masks. DRA incorporates adaptive dual attention and diversity-promoting multi-head attention to fuse multimodal information, from three modalities, based on surgical context, enhancing assessment accuracy and reliability. Validated on the JIGSAWS benchmark, RAH-skill, and RARP-skill datasets, our approach outperforms recent baselines with SCC improvements of 0.02 in LOSO, 0.04 in LOUO across JIGSAWS tasks, and 0.0538 and 0.0493 gains on RAH-skill and RARP-skill, respectively.
Paper Structure (26 sections, 19 equations, 4 figures, 5 tables)

This paper contains 26 sections, 19 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Examples from the multimodal RAH-skill and RARP-skill datasets, along with the enhanced JIGSAWS dataset. JIGSAWS comprises three tasks: Suturing, Knot Tying, and Needle Passing. All datasets contain RGB frames, optical flow maps, and segmentation masks. In optical flow maps, black background indicates no motion, brightness represents motion speed, and colors indicate motion directions. RAH-skill and RARP-skill include tool segmentation masks, while JIGSAWS contains both tool and reference object masks.
  • Figure 2: SurgFusion-Net: the network consists of three unimodal branches and a multimodal fusion branch. The architecture progressively fuses unimodal features to construct comprehensive multimodal representations for robotic surgical skill assessment.
  • Figure 3: Computation flow of Divergence Regulated Attention. $Q_h$, $K_h$, $V_h$ are query, key, value representations for each head. $\otimes$: matrix multiplication; $\oplus$: element-wise addition; Proj: projection; Sim: similarity.
  • Figure 4: Visualization of attention weights of the cross-stage fusion block (CSFB) in the stage two on RAH-skill dataset. The top left figure displays the attention weights across three modalities on feature sequences, with four purple vertical dashed lines marking the time windows of specific features. Figures (a)-(d) show RGB frames, optical flows, and segmentation masks (from top to bottom in each column) corresponding to time windows T1 to T4. In stage two, each horizontal interval corresponds to about 14 seconds of video.