Table of Contents
Fetching ...

EgoSurgery-HTS: A Dataset for Egocentric Hand-Tool Segmentation in Open Surgery Videos

Nathan Darjana, Ryo Fujii, Hideo Saito, Hiroki Kajita

TL;DR

EgoSurgery-HTS addresses the need for pixel-level understanding in egocentric open-surgery videos by introducing a comprehensive dataset annotated for tool instance segmentation (14 tools), hand instance segmentation (4 hands), and hand–tool interactions. Built on the EgoSurgery platform, the dataset uses SAM-based annotation to generate high-quality segmentation masks from bounding boxes, supplemented by manual corrections, and provides extensive statistics on tool co-occurrence and hand–tool associations. The authors benchmark four mainstream detectors (Mask R-CNN, QueryInst, Mask2Former, SOLOv2) across three tasks, demonstrating that specialized architectures like Mask2Former and QueryInst achieve strong performance, particularly in hand and hand–tool segmentation, and that training on EgoSurgery-HTS yields domain-transfer benefits over existing datasets like EgoHands and VISOR-HOS. The work establishes a new standard and benchmark for open-surgery scene understanding, enabling more accurate action recognition, workflow analysis, and potential real-time AI-assisted interventions, while acknowledging current limitations such as data imbalance and the need for more diverse tools and environments.

Abstract

Egocentric open-surgery videos capture rich, fine-grained details essential for accurately modeling surgical procedures and human behavior in the operating room. A detailed, pixel-level understanding of hands and surgical tools is crucial for interpreting a surgeon's actions and intentions. We introduce EgoSurgery-HTS, a new dataset with pixel-wise annotations and a benchmark suite for segmenting surgical tools, hands, and interacting tools in egocentric open-surgery videos. Specifically, we provide a labeled dataset for (1) tool instance segmentation of 14 distinct surgical tools, (2) hand instance segmentation, and (3) hand-tool segmentation to label hands and the tools they manipulate. Using EgoSurgery-HTS, we conduct extensive evaluations of state-of-the-art segmentation methods and demonstrate significant improvements in the accuracy of hand and hand-tool segmentation in egocentric open-surgery videos compared to existing datasets. The dataset will be released at https://github.com/Fujiry0/EgoSurgery.

EgoSurgery-HTS: A Dataset for Egocentric Hand-Tool Segmentation in Open Surgery Videos

TL;DR

EgoSurgery-HTS addresses the need for pixel-level understanding in egocentric open-surgery videos by introducing a comprehensive dataset annotated for tool instance segmentation (14 tools), hand instance segmentation (4 hands), and hand–tool interactions. Built on the EgoSurgery platform, the dataset uses SAM-based annotation to generate high-quality segmentation masks from bounding boxes, supplemented by manual corrections, and provides extensive statistics on tool co-occurrence and hand–tool associations. The authors benchmark four mainstream detectors (Mask R-CNN, QueryInst, Mask2Former, SOLOv2) across three tasks, demonstrating that specialized architectures like Mask2Former and QueryInst achieve strong performance, particularly in hand and hand–tool segmentation, and that training on EgoSurgery-HTS yields domain-transfer benefits over existing datasets like EgoHands and VISOR-HOS. The work establishes a new standard and benchmark for open-surgery scene understanding, enabling more accurate action recognition, workflow analysis, and potential real-time AI-assisted interventions, while acknowledging current limitations such as data imbalance and the need for more diverse tools and environments.

Abstract

Egocentric open-surgery videos capture rich, fine-grained details essential for accurately modeling surgical procedures and human behavior in the operating room. A detailed, pixel-level understanding of hands and surgical tools is crucial for interpreting a surgeon's actions and intentions. We introduce EgoSurgery-HTS, a new dataset with pixel-wise annotations and a benchmark suite for segmenting surgical tools, hands, and interacting tools in egocentric open-surgery videos. Specifically, we provide a labeled dataset for (1) tool instance segmentation of 14 distinct surgical tools, (2) hand instance segmentation, and (3) hand-tool segmentation to label hands and the tools they manipulate. Using EgoSurgery-HTS, we conduct extensive evaluations of state-of-the-art segmentation methods and demonstrate significant improvements in the accuracy of hand and hand-tool segmentation in egocentric open-surgery videos compared to existing datasets. The dataset will be released at https://github.com/Fujiry0/EgoSurgery.

Paper Structure

This paper contains 10 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of the sparse segmentation from 2 different videos according to the different tasks. (a)-(c)-(e) Overview of every tool and hand instance segmentation task. (b)-(d)-(f) Overview of Hand-Object Segmentation task.
  • Figure 2: Dataset Statistics (a) Distribution of different hands and surgical tools instances. (b) Frame level co-occurrence matrix tools in the dataset. (c) Distribution of hands and its associated tools. (d) Tool usage counts based on manual handling.
  • Figure 3: Mask2Former Confusion Matrix over the three tasks. (a) Tool Confusion Matrix. (b) Hand Confusion Matrix (c) Hand-Tool Confusion Matrix
  • Figure 4: Overview of the Mask2former model predictions on the three tasks. (a)Tool Segmentation task. (b)Hand Segmentation task. (c) Hand-Tool Segmenation Task.
  • Figure 5: Qualitative predictions of QueryInst trained on EgoSurgery-HTS dataset versus trained on Egohands (left) for hand segmentation and trained on VISOR-HOS (right) for hand-tool segmentation task