Grounding Surgical Action Triplets with Instrument Instance Segmentation: A Dataset and Target-Aware Fusion Approach
Oluwatosin Alabi, Meng Wei, Charlie Budd, Tom Vercauteren, Miaojing Shi
TL;DR
This work defines triplet segmentation, a unified task that grounds surgical action triplets in space by linking instrument instances to verb targets via pixel-level masks. It introduces CholecTriplet-Seg, a large-scale dataset pairing instrument instance masks with verb–target annotations across $30{,}955$ frames and $49{,}866$ grounded triplets from $50$ videos, and proposes Triplet Segmentation mAP as a metric. The proposed TargetFusionNet extends Mask2Former with a target-aware fusion module that incorporates weak anatomy priors from EndoViT, significantly improving target prediction and overall triplet grounding compared to baselines. This approach advances interpretable, fine-grained surgical scene understanding and establishes a solid foundation for future spatially grounded analysis in surgical workflows.
Abstract
Understanding surgical instrument-tissue interactions requires not only identifying which instrument performs which action on which anatomical target, but also grounding these interactions spatially within the surgical scene. Existing surgical action triplet recognition methods are limited to learning from frame-level classification, failing to reliably link actions to specific instrument instances.Previous attempts at spatial grounding have primarily relied on class activation maps, which lack the precision and robustness required for detailed instrument-tissue interaction analysis.To address this gap, we propose grounding surgical action triplets with instrument instance segmentation, or triplet segmentation for short, a new unified task which produces spatially grounded <instrument, verb, target> outputs.We start by presenting CholecTriplet-Seg, a large-scale dataset containing over 30,000 annotated frames, linking instrument instance masks with action verb and anatomical target annotations, and establishing the first benchmark for strongly supervised, instance-level triplet grounding and evaluation.To learn triplet segmentation, we propose TargetFusionNet, a novel architecture that extends Mask2Former with a target-aware fusion mechanism to address the challenge of accurate anatomical target prediction by fusing weak anatomy priors with instrument instance queries.Evaluated across recognition, detection, and triplet segmentation metrics, TargetFusionNet consistently improves performance over existing baselines, demonstrating that strong instance supervision combined with weak target priors significantly enhances the accuracy and robustness of surgical action understanding.Triplet segmentation establishes a unified framework for spatially grounding surgical action triplets. The proposed benchmark and architecture pave the way for more interpretable, surgical scene understanding.
