Table of Contents
Fetching ...

Benchmarking and Enhancing Surgical Phase Recognition Models for Robotic-Assisted Esophagectomy

Yiping Li, Romy van Jaarsveld, Ronald de Jong, Jasper Bongers, Gino Kuiper, Richard van Hillegersberg, Jelle Ruurda, Marcel Breeuwer, Yasmina Al Khalil

TL;DR

The paper tackles surgical phase recognition in robot-assisted esophagectomy (RAMIE), introducing a RAMIE-specific dataset of 27 videos and benchmarking four state-of-the-art models. It then presents a novel encoder-decoder model with causal hierarchical attention to better capture complex temporal dynamics, trained via a two-stage pipeline on ResNet50 frame features. The approach yields substantial improvements in temporal continuity and phase boundary detection, outperforming baselines on RAMIE and AutoLaparo benchmarks, and providing insights into error patterns near phase transitions. This work lays a foundation for more reliable intraoperative guidance and post-hoc analysis in RAMIE, with potential to improve surgical workflow and patient safety, and it points to future directions in multi-surgeon validation and clinical-oriented metrics.

Abstract

Robotic-assisted minimally invasive esophagectomy (RAMIE) is a recognized treatment for esophageal cancer, offering better patient outcomes compared to open surgery and traditional minimally invasive surgery. RAMIE is highly complex, spanning multiple anatomical areas and involving repetitive phases and non-sequential phase transitions. Our goal is to leverage deep learning for surgical phase recognition in RAMIE to provide intraoperative support to surgeons. To achieve this, we have developed a new surgical phase recognition dataset comprising 27 videos. Using this dataset, we conducted a comparative analysis of state-of-the-art surgical phase recognition models. To more effectively capture the temporal dynamics of this complex procedure, we developed a novel deep learning model featuring an encoder-decoder structure with causal hierarchical attention, which demonstrates superior performance compared to existing models.

Benchmarking and Enhancing Surgical Phase Recognition Models for Robotic-Assisted Esophagectomy

TL;DR

The paper tackles surgical phase recognition in robot-assisted esophagectomy (RAMIE), introducing a RAMIE-specific dataset of 27 videos and benchmarking four state-of-the-art models. It then presents a novel encoder-decoder model with causal hierarchical attention to better capture complex temporal dynamics, trained via a two-stage pipeline on ResNet50 frame features. The approach yields substantial improvements in temporal continuity and phase boundary detection, outperforming baselines on RAMIE and AutoLaparo benchmarks, and providing insights into error patterns near phase transitions. This work lays a foundation for more reliable intraoperative guidance and post-hoc analysis in RAMIE, with potential to improve surgical workflow and patient safety, and it points to future directions in multi-surgeon validation and clinical-oriented metrics.

Abstract

Robotic-assisted minimally invasive esophagectomy (RAMIE) is a recognized treatment for esophageal cancer, offering better patient outcomes compared to open surgery and traditional minimally invasive surgery. RAMIE is highly complex, spanning multiple anatomical areas and involving repetitive phases and non-sequential phase transitions. Our goal is to leverage deep learning for surgical phase recognition in RAMIE to provide intraoperative support to surgeons. To achieve this, we have developed a new surgical phase recognition dataset comprising 27 videos. Using this dataset, we conducted a comparative analysis of state-of-the-art surgical phase recognition models. To more effectively capture the temporal dynamics of this complex procedure, we developed a novel deep learning model featuring an encoder-decoder structure with causal hierarchical attention, which demonstrates superior performance compared to existing models.

Paper Structure

This paper contains 15 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Schematic representation of RAMIE thoracic phases
  • Figure 2: Number of frames per phase in RAMIE dataset
  • Figure 3: Proposed model architecture (left) and comparison of hierarchical attention in causal and non-causal settings (right), adapted from ASFormer yi2021asformer. For each layer $l \in \{1, \ldots, L\}$, the query tensor $Q_l \in \mathbb{R}^{T_l \times d \times h_l}$ and the key tensor $K_l \in \mathbb{R}^{T_l \times d \times 2h_l}$ are defined, where $T_l = \left\lfloor\frac{T_0}{2^{l-1}}\right\rfloor$ is the sequence length, $d$ is the feature dimension, and $h_l = 2^{l-1}$ is the head dimension. A causal mask is applied to ensure that each position can only attend to previous positions in the sequence. The right image illustrates the difference between non-causal (bottom) and causal (top) hierarchical attention resulting from the causal dilated convolution.
  • Figure 4: Mean F1 scores across surgical phases in RAMIE dataset
  • Figure 5: Qualitative Result on RAMIE dataset