Table of Contents
Fetching ...

Grounding Surgical Action Triplets with Instrument Instance Segmentation: A Dataset and Target-Aware Fusion Approach

Oluwatosin Alabi, Meng Wei, Charlie Budd, Tom Vercauteren, Miaojing Shi

TL;DR

This work defines triplet segmentation, a unified task that grounds surgical action triplets in space by linking instrument instances to verb targets via pixel-level masks. It introduces CholecTriplet-Seg, a large-scale dataset pairing instrument instance masks with verb–target annotations across $30{,}955$ frames and $49{,}866$ grounded triplets from $50$ videos, and proposes Triplet Segmentation mAP as a metric. The proposed TargetFusionNet extends Mask2Former with a target-aware fusion module that incorporates weak anatomy priors from EndoViT, significantly improving target prediction and overall triplet grounding compared to baselines. This approach advances interpretable, fine-grained surgical scene understanding and establishes a solid foundation for future spatially grounded analysis in surgical workflows.

Abstract

Understanding surgical instrument-tissue interactions requires not only identifying which instrument performs which action on which anatomical target, but also grounding these interactions spatially within the surgical scene. Existing surgical action triplet recognition methods are limited to learning from frame-level classification, failing to reliably link actions to specific instrument instances.Previous attempts at spatial grounding have primarily relied on class activation maps, which lack the precision and robustness required for detailed instrument-tissue interaction analysis.To address this gap, we propose grounding surgical action triplets with instrument instance segmentation, or triplet segmentation for short, a new unified task which produces spatially grounded <instrument, verb, target> outputs.We start by presenting CholecTriplet-Seg, a large-scale dataset containing over 30,000 annotated frames, linking instrument instance masks with action verb and anatomical target annotations, and establishing the first benchmark for strongly supervised, instance-level triplet grounding and evaluation.To learn triplet segmentation, we propose TargetFusionNet, a novel architecture that extends Mask2Former with a target-aware fusion mechanism to address the challenge of accurate anatomical target prediction by fusing weak anatomy priors with instrument instance queries.Evaluated across recognition, detection, and triplet segmentation metrics, TargetFusionNet consistently improves performance over existing baselines, demonstrating that strong instance supervision combined with weak target priors significantly enhances the accuracy and robustness of surgical action understanding.Triplet segmentation establishes a unified framework for spatially grounding surgical action triplets. The proposed benchmark and architecture pave the way for more interpretable, surgical scene understanding.

Grounding Surgical Action Triplets with Instrument Instance Segmentation: A Dataset and Target-Aware Fusion Approach

TL;DR

This work defines triplet segmentation, a unified task that grounds surgical action triplets in space by linking instrument instances to verb targets via pixel-level masks. It introduces CholecTriplet-Seg, a large-scale dataset pairing instrument instance masks with verb–target annotations across frames and grounded triplets from videos, and proposes Triplet Segmentation mAP as a metric. The proposed TargetFusionNet extends Mask2Former with a target-aware fusion module that incorporates weak anatomy priors from EndoViT, significantly improving target prediction and overall triplet grounding compared to baselines. This approach advances interpretable, fine-grained surgical scene understanding and establishes a solid foundation for future spatially grounded analysis in surgical workflows.

Abstract

Understanding surgical instrument-tissue interactions requires not only identifying which instrument performs which action on which anatomical target, but also grounding these interactions spatially within the surgical scene. Existing surgical action triplet recognition methods are limited to learning from frame-level classification, failing to reliably link actions to specific instrument instances.Previous attempts at spatial grounding have primarily relied on class activation maps, which lack the precision and robustness required for detailed instrument-tissue interaction analysis.To address this gap, we propose grounding surgical action triplets with instrument instance segmentation, or triplet segmentation for short, a new unified task which produces spatially grounded <instrument, verb, target> outputs.We start by presenting CholecTriplet-Seg, a large-scale dataset containing over 30,000 annotated frames, linking instrument instance masks with action verb and anatomical target annotations, and establishing the first benchmark for strongly supervised, instance-level triplet grounding and evaluation.To learn triplet segmentation, we propose TargetFusionNet, a novel architecture that extends Mask2Former with a target-aware fusion mechanism to address the challenge of accurate anatomical target prediction by fusing weak anatomy priors with instrument instance queries.Evaluated across recognition, detection, and triplet segmentation metrics, TargetFusionNet consistently improves performance over existing baselines, demonstrating that strong instance supervision combined with weak target priors significantly enhances the accuracy and robustness of surgical action understanding.Triplet segmentation establishes a unified framework for spatially grounding surgical action triplets. The proposed benchmark and architecture pave the way for more interpretable, surgical scene understanding.

Paper Structure

This paper contains 20 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Comparison of instance segmentation, triplet recognition, and triplet segmentation. Triplet segmentation grounds $\langle$instrument, verb, target$\rangle$ triplets in space, unifying triplet recognition and instrument localisation for interpretable surgical scene understanding.
  • Figure 2: Overview of the TargetFusionNet architecture (left) and the Transformer Decoder Layer with Target Fusion Module (right). TargetFusionNet augments Mask2Former with an additional encoder that processes weak anatomy logits from EndoViT into multi-scale feature maps. These weak anatomy features serve as auxiliary memory for the transformer decoder. Each transformer decoder layer contains a Target Fusion Module, where instance queries interact with weak anatomy features via gated cross-attention. This allows the model to selectively incorporate coarse anatomical priors while refining triplet predictions.
  • Figure 3: Qualitative comparisons across five frames (a–e). Models shown: RDV-Det, RDV-Det + Mask2Former, Mask2Former-Triplet, and TargetFusionNet. Triplet predictions are shown overlaid on the image. Our method achieves more accurate spatial grounding and better triplet consistency in most cases.