Table of Contents
Fetching ...

Towards Holistic Surgical Scene Understanding

Natalia Valderrama, Paola Ruiz Puentes, Isabela Hernández, Nicolás Ayobi, Mathilde Verlyk, Jessica Santander, Juan Caicedo, Nicolás Fernández, Pablo Arbeláez

TL;DR

This work tackles holistic surgical scene understanding by proposing a multi-level framework that jointly analyzes phases, steps, instrument detections, and atomic actions in robot-assisted radical prostatectomies. It introduces PSI-AVA, a richly annotated dataset enabling both long-term and short-term reasoning, and TAPIR, a transformer-based baseline that fuses frame-level video features with box-level instrument cues for multi-task prediction. Results show transformer-based architectures outperform DCNN baselines across tasks and demonstrate the value of multi-level annotations, with cross-dataset validation supporting PSI-AVA as a unified benchmark. The work provides publicly available data, code, and models to spur future research toward autonomous, context-aware intraoperative understanding and assistance.

Abstract

Most benchmarks for studying surgical interventions focus on a specific challenge instead of leveraging the intrinsic complementarity among different tasks. In this work, we present a new experimental framework towards holistic surgical scene understanding. First, we introduce the Phase, Step, Instrument, and Atomic Visual Action recognition (PSI-AVA) Dataset. PSI-AVA includes annotations for both long-term (Phase and Step recognition) and short-term reasoning (Instrument detection and novel Atomic Action recognition) in robot-assisted radical prostatectomy videos. Second, we present Transformers for Action, Phase, Instrument, and steps Recognition (TAPIR) as a strong baseline for surgical scene understanding. TAPIR leverages our dataset's multi-level annotations as it benefits from the learned representation on the instrument detection task to improve its classification capacity. Our experimental results in both PSI-AVA and other publicly available databases demonstrate the adequacy of our framework to spur future research on holistic surgical scene understanding.

Towards Holistic Surgical Scene Understanding

TL;DR

This work tackles holistic surgical scene understanding by proposing a multi-level framework that jointly analyzes phases, steps, instrument detections, and atomic actions in robot-assisted radical prostatectomies. It introduces PSI-AVA, a richly annotated dataset enabling both long-term and short-term reasoning, and TAPIR, a transformer-based baseline that fuses frame-level video features with box-level instrument cues for multi-task prediction. Results show transformer-based architectures outperform DCNN baselines across tasks and demonstrate the value of multi-level annotations, with cross-dataset validation supporting PSI-AVA as a unified benchmark. The work provides publicly available data, code, and models to spur future research toward autonomous, context-aware intraoperative understanding and assistance.

Abstract

Most benchmarks for studying surgical interventions focus on a specific challenge instead of leveraging the intrinsic complementarity among different tasks. In this work, we present a new experimental framework towards holistic surgical scene understanding. First, we introduce the Phase, Step, Instrument, and Atomic Visual Action recognition (PSI-AVA) Dataset. PSI-AVA includes annotations for both long-term (Phase and Step recognition) and short-term reasoning (Instrument detection and novel Atomic Action recognition) in robot-assisted radical prostatectomy videos. Second, we present Transformers for Action, Phase, Instrument, and steps Recognition (TAPIR) as a strong baseline for surgical scene understanding. TAPIR leverages our dataset's multi-level annotations as it benefits from the learned representation on the instrument detection task to improve its classification capacity. Our experimental results in both PSI-AVA and other publicly available databases demonstrate the adequacy of our framework to spur future research on holistic surgical scene understanding.
Paper Structure (9 sections, 2 figures, 4 tables)

This paper contains 9 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: PSI-AVA Dataset enables a holistic analysis of surgical videos through annotations for both long-term (Phase and Step Recognition) and short-term reasoning tasks (Instrument Detection and Atomic Action Recognition).
  • Figure 2: TAPIR. Our approach leverages the global, temporal information extracted from a surgical video sequence with localized appearance cues. The association of complementary information sources enables the multi-level reasoning over dynamic surgical scenes. Best viewed in color.