Table of Contents
Fetching ...

Benchmarking Pretrained Attention-based Models for Real-Time Recognition in Robot-Assisted Esophagectomy

Ronald L. P. D. de Jong, Yasmina al Khalil, Tim J. M. Jaspers, Romy C. van Jaarsveld, Gino M. Kuiper, Yiping Li, Richard van Hillegersberg, Jelle P. Ruurda, Marcel Breeuwer, Fons van der Sommen

TL;DR

Robot-assisted minimally invasive esophagectomy (RAMIE) poses substantial navigation challenges due to complex anatomy and occlusions. The authors introduce RAMIE, a large semantic segmentation dataset, and benchmark eight real-time models across RAMIE and CholecSeg8k with two pretraining schemes. They find ADE20k pretraining outperforms ImageNet and that attention-based models (Mask2Former, SegNeXt) deliver the best Dice and boundary accuracy, albeit with slower inference. The work provides actionable guidance on pretraining strategies and model selection to enhance real-time surgical navigation and surgeon training.

Abstract

Esophageal cancer is among the most common types of cancer worldwide. It is traditionally treated using open esophagectomy, but in recent years, robot-assisted minimally invasive esophagectomy (RAMIE) has emerged as a promising alternative. However, robot-assisted surgery can be challenging for novice surgeons, as they often suffer from a loss of spatial orientation. Computer-aided anatomy recognition holds promise for improving surgical navigation, but research in this area remains limited. In this study, we developed a comprehensive dataset for semantic segmentation in RAMIE, featuring the largest collection of vital anatomical structures and surgical instruments to date. Handling this diverse set of classes presents challenges, including class imbalance and the recognition of complex structures such as nerves. This study aims to understand the challenges and limitations of current state-of-the-art algorithms on this novel dataset and problem. Therefore, we benchmarked eight real-time deep learning models using two pretraining datasets. We assessed both traditional and attention-based networks, hypothesizing that attention-based networks better capture global patterns and address challenges such as occlusion caused by blood or other tissues. The benchmark includes our RAMIE dataset and the publicly available CholecSeg8k dataset, enabling a thorough assessment of surgical segmentation tasks. Our findings indicate that pretraining on ADE20k, a dataset for semantic segmentation, is more effective than pretraining on ImageNet. Furthermore, attention-based models outperform traditional convolutional neural networks, with SegNeXt and Mask2Former achieving higher Dice scores, and Mask2Former additionally excelling in average symmetric surface distance.

Benchmarking Pretrained Attention-based Models for Real-Time Recognition in Robot-Assisted Esophagectomy

TL;DR

Robot-assisted minimally invasive esophagectomy (RAMIE) poses substantial navigation challenges due to complex anatomy and occlusions. The authors introduce RAMIE, a large semantic segmentation dataset, and benchmark eight real-time models across RAMIE and CholecSeg8k with two pretraining schemes. They find ADE20k pretraining outperforms ImageNet and that attention-based models (Mask2Former, SegNeXt) deliver the best Dice and boundary accuracy, albeit with slower inference. The work provides actionable guidance on pretraining strategies and model selection to enhance real-time surgical navigation and surgeon training.

Abstract

Esophageal cancer is among the most common types of cancer worldwide. It is traditionally treated using open esophagectomy, but in recent years, robot-assisted minimally invasive esophagectomy (RAMIE) has emerged as a promising alternative. However, robot-assisted surgery can be challenging for novice surgeons, as they often suffer from a loss of spatial orientation. Computer-aided anatomy recognition holds promise for improving surgical navigation, but research in this area remains limited. In this study, we developed a comprehensive dataset for semantic segmentation in RAMIE, featuring the largest collection of vital anatomical structures and surgical instruments to date. Handling this diverse set of classes presents challenges, including class imbalance and the recognition of complex structures such as nerves. This study aims to understand the challenges and limitations of current state-of-the-art algorithms on this novel dataset and problem. Therefore, we benchmarked eight real-time deep learning models using two pretraining datasets. We assessed both traditional and attention-based networks, hypothesizing that attention-based networks better capture global patterns and address challenges such as occlusion caused by blood or other tissues. The benchmark includes our RAMIE dataset and the publicly available CholecSeg8k dataset, enabling a thorough assessment of surgical segmentation tasks. Our findings indicate that pretraining on ADE20k, a dataset for semantic segmentation, is more effective than pretraining on ImageNet. Furthermore, attention-based models outperform traditional convolutional neural networks, with SegNeXt and Mask2Former achieving higher Dice scores, and Mask2Former additionally excelling in average symmetric surface distance.

Paper Structure

This paper contains 14 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Example frames with corresponding overlays of all distinct classes in the RAMIE dataset. One overlay is shown per class, even when the class is visible in multiple images.
  • Figure 2: (Left) Number of annotated images per class. (Right) Proportion of annotated pixels per class as a percentage of the total number of annotated pixels.
  • Figure 3: Dice scores per class for RAMIE (left) and CholecSeg8k (right), averaged across models with attention (Mask2Former, Segformer, Segmenter, SegNeXt) and without attention (DeepLabv3, DeepLabv3+, PSPNet, FPN).
  • Figure 4: Visualization of input frames, reference annotations, and predictions on the RAMIE dataset using DeepLabv3+, SegNeXt, and Mask2Former, each pretrained on ADE20k.
  • Figure 5: Visualization of input frames, reference annotations, and predictions on the CholecSeg8k dataset using DeepLabv3+, SegNeXt, and Mask2Former, each pretrained on ADE20k.