Towards Holistic Surgical Scene Understanding
Natalia Valderrama, Paola Ruiz Puentes, Isabela Hernández, Nicolás Ayobi, Mathilde Verlyk, Jessica Santander, Juan Caicedo, Nicolás Fernández, Pablo Arbeláez
TL;DR
This work tackles holistic surgical scene understanding by proposing a multi-level framework that jointly analyzes phases, steps, instrument detections, and atomic actions in robot-assisted radical prostatectomies. It introduces PSI-AVA, a richly annotated dataset enabling both long-term and short-term reasoning, and TAPIR, a transformer-based baseline that fuses frame-level video features with box-level instrument cues for multi-task prediction. Results show transformer-based architectures outperform DCNN baselines across tasks and demonstrate the value of multi-level annotations, with cross-dataset validation supporting PSI-AVA as a unified benchmark. The work provides publicly available data, code, and models to spur future research toward autonomous, context-aware intraoperative understanding and assistance.
Abstract
Most benchmarks for studying surgical interventions focus on a specific challenge instead of leveraging the intrinsic complementarity among different tasks. In this work, we present a new experimental framework towards holistic surgical scene understanding. First, we introduce the Phase, Step, Instrument, and Atomic Visual Action recognition (PSI-AVA) Dataset. PSI-AVA includes annotations for both long-term (Phase and Step recognition) and short-term reasoning (Instrument detection and novel Atomic Action recognition) in robot-assisted radical prostatectomy videos. Second, we present Transformers for Action, Phase, Instrument, and steps Recognition (TAPIR) as a strong baseline for surgical scene understanding. TAPIR leverages our dataset's multi-level annotations as it benefits from the learned representation on the instrument detection task to improve its classification capacity. Our experimental results in both PSI-AVA and other publicly available databases demonstrate the adequacy of our framework to spur future research on holistic surgical scene understanding.
