Table of Contents
Fetching ...

GroundedSurg: A Multi-Procedure Benchmark for Language-Conditioned Surgical Tool Segmentation

Tajamul Ashraf, Abrar Ul Riyaz, Wasif Tak, Tavaheed Tariq, Sonia Yadav, Moloud Abdar, Janibul Bashir

TL;DR

GroundedSurg is introduced, the first language-conditioned, instance-level surgical grounding benchmark, which enables a systematic and realistic evaluation of vision-language models in clinically realistic multi-instrument scenes.

Abstract

Clinically reliable perception of surgical scenes is essential for advancing intelligent, context-aware intraoperative assistance such as instrument handoff guidance, collision avoidance, and workflow-aware robotic support. Existing surgical tool benchmarks primarily evaluate category-level segmentation, requiring models to detect all instances of predefined instrument classes. However, real-world clinical decisions often require resolving references to a specific instrument instance based on its functional role, spatial relation, or anatomical interaction capabilities not captured by current evaluation paradigms. We introduce GroundedSurg, the first language-conditioned, instance-level surgical grounding benchmark. Each instance pairs a surgical image with a natural-language description targeting a single instrument, accompanied by structured spatial grounding annotations including bounding boxes and point-level anchors. The dataset spans ophthalmic, laparoscopic, robotic, and open procedures, encompassing diverse instrument types, imaging conditions, and operative complexities. By jointly evaluating linguistic reference resolution and pixel-level localization, GroundedSurg enables a systematic and realistic evaluation of vision-language models in clinically realistic multi-instrument scenes. Extensive experiments demonstrate substantial performance gaps across modern segmentation and VLMs, highlighting the urgent need for clinically grounded vision-language reasoning in surgical AI systems. Code and data are publicly available at https://github.com/gaash-lab/GroundedSurg

GroundedSurg: A Multi-Procedure Benchmark for Language-Conditioned Surgical Tool Segmentation

TL;DR

GroundedSurg is introduced, the first language-conditioned, instance-level surgical grounding benchmark, which enables a systematic and realistic evaluation of vision-language models in clinically realistic multi-instrument scenes.

Abstract

Clinically reliable perception of surgical scenes is essential for advancing intelligent, context-aware intraoperative assistance such as instrument handoff guidance, collision avoidance, and workflow-aware robotic support. Existing surgical tool benchmarks primarily evaluate category-level segmentation, requiring models to detect all instances of predefined instrument classes. However, real-world clinical decisions often require resolving references to a specific instrument instance based on its functional role, spatial relation, or anatomical interaction capabilities not captured by current evaluation paradigms. We introduce GroundedSurg, the first language-conditioned, instance-level surgical grounding benchmark. Each instance pairs a surgical image with a natural-language description targeting a single instrument, accompanied by structured spatial grounding annotations including bounding boxes and point-level anchors. The dataset spans ophthalmic, laparoscopic, robotic, and open procedures, encompassing diverse instrument types, imaging conditions, and operative complexities. By jointly evaluating linguistic reference resolution and pixel-level localization, GroundedSurg enables a systematic and realistic evaluation of vision-language models in clinically realistic multi-instrument scenes. Extensive experiments demonstrate substantial performance gaps across modern segmentation and VLMs, highlighting the urgent need for clinically grounded vision-language reasoning in surgical AI systems. Code and data are publicly available at https://github.com/gaash-lab/GroundedSurg
Paper Structure (4 sections, 4 figures, 5 tables)

This paper contains 4 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of GroundedSurg. (a) Existing datasets focus on category-level segmentation without language conditioning or instance-level grounding. (b) GroundedSurg introduces natural-language queries with structured spatial annotations for query-conditioned instrument localization. (c) Baseline results reveal substantial performance gaps, highlighting the challenges of grounded surgical perception.
  • Figure 2: GroundedSurg benchmark pipeline. Surgical images are paired with initial prompts and processed by a vision–language model to generate structured instrument descriptions. All 1071 queries are human and clinician verified for semantic correctness and ambiguity removal. Final annotations are stored in a standardized JSON schema with spatial grounding (bounding box and center point) and segmentation masks.
  • Figure 3: Qualitative comparison on GroundedSurg showing that reasoning-oriented models produce more spatially precise masks than general-purpose models when projecting structured localization outputs onto a frozen segmentation backend, particularly in multi-instrument and visually cluttered scenes.
  • Figure :