Table of Contents
Fetching ...

Temporal Propagation of Asymmetric Feature Pyramid for Surgical Scene Segmentation

Cheng Yuan, Yutong Ban

TL;DR

This paper introduces the Temporal Asymmetric Feature Propagation Network (TAFPNet) to address static-image ambiguities and dynamic video challenges in surgical scene segmentation. It combines a bidirectional attention network with a Temporal Query Propagator (TQP) and an Aggregated Asymmetric Feature Pyramid (AAFP) to propagate temporal information and disentangle anatomy from instruments across frames. Empirical results on EndoVis2018 and Endoscapes2023 show substantial improvements in mIoU and mAP, including +16.4% mIoU on EndoVis2018 and +3.3% mAP on Endoscapes2023, validating the effectiveness of temporal propagation and asymmetric attention for fine-grained, occlusion-prone surgical scenes. The proposed approach advances robust, temporally coherent surgical scene understanding, with potential impact on robot-assisted interventions and related medical imaging applications.

Abstract

Surgical scene segmentation is crucial for robot-assisted laparoscopic surgery understanding. Current approaches face two challenges: (i) static image limitations including ambiguous local feature similarities and fine-grained structural details, and (ii) dynamic video complexities arising from rapid instrument motion and persistent visual occlusions. While existing methods mainly focus on spatial feature extraction, they fundamentally overlook temporal dependencies in surgical video streams. To address this, we present temporal asymmetric feature propagation network, a bidirectional attention architecture enabling cross-frame feature propagation. The proposed method contains a temporal query propagator that integrates multi-directional consistency constraints to enhance frame-specific feature representation, and an aggregated asymmetric feature pyramid module that preserves discriminative features for anatomical structures and surgical instruments. Our framework uniquely enables both temporal guidance and contextual reasoning for surgical scene understanding. Comprehensive evaluations on two public benchmarks show the proposed method outperforms the current SOTA methods by a large margin, with +16.4\% mIoU on EndoVis2018 and +3.3\% mAP on Endoscapes2023. The code will be publicly available after paper acceptance.

Temporal Propagation of Asymmetric Feature Pyramid for Surgical Scene Segmentation

TL;DR

This paper introduces the Temporal Asymmetric Feature Propagation Network (TAFPNet) to address static-image ambiguities and dynamic video challenges in surgical scene segmentation. It combines a bidirectional attention network with a Temporal Query Propagator (TQP) and an Aggregated Asymmetric Feature Pyramid (AAFP) to propagate temporal information and disentangle anatomy from instruments across frames. Empirical results on EndoVis2018 and Endoscapes2023 show substantial improvements in mIoU and mAP, including +16.4% mIoU on EndoVis2018 and +3.3% mAP on Endoscapes2023, validating the effectiveness of temporal propagation and asymmetric attention for fine-grained, occlusion-prone surgical scenes. The proposed approach advances robust, temporally coherent surgical scene understanding, with potential impact on robot-assisted interventions and related medical imaging applications.

Abstract

Surgical scene segmentation is crucial for robot-assisted laparoscopic surgery understanding. Current approaches face two challenges: (i) static image limitations including ambiguous local feature similarities and fine-grained structural details, and (ii) dynamic video complexities arising from rapid instrument motion and persistent visual occlusions. While existing methods mainly focus on spatial feature extraction, they fundamentally overlook temporal dependencies in surgical video streams. To address this, we present temporal asymmetric feature propagation network, a bidirectional attention architecture enabling cross-frame feature propagation. The proposed method contains a temporal query propagator that integrates multi-directional consistency constraints to enhance frame-specific feature representation, and an aggregated asymmetric feature pyramid module that preserves discriminative features for anatomical structures and surgical instruments. Our framework uniquely enables both temporal guidance and contextual reasoning for surgical scene understanding. Comprehensive evaluations on two public benchmarks show the proposed method outperforms the current SOTA methods by a large margin, with +16.4\% mIoU on EndoVis2018 and +3.3\% mAP on Endoscapes2023. The code will be publicly available after paper acceptance.

Paper Structure

This paper contains 11 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Challenges in surgical scene segmentation: (a) local feature similarity and fine-grained structure complexity in a single image; (b) rapid object motion and inevitable interaction occlusion in video sequences.
  • Figure 2: The overall framework of TAFPNet. It contains a (a) bidirectional attention architecture injected with the (b) Temporal Query Propagator (TQP) and the (c) Aggregated Asymmetric Feature Pyramid (AAFP) module.
  • Figure 3: Visual comparison of segmentation results on (a) EndoVis2018 and (b) Endoscapes2023. From top to bottom, for each dataset, we present three continuous video frames and their corresponding ground truth, with segmentation results using BaseNet, AFPNet and our proposed TAFPNet.