Table of Contents
Fetching ...

Rethinking RGB-D Fusion for Semantic Segmentation in Surgical Datasets

Muhammad Abdullah Jamal, Omid Mohareri

TL;DR

The paper tackles semantic segmentation in surgical scenes by leveraging RGB-D data to overcome occlusion and lighting challenges. It introduces SurgDepth, a Vision Transformer-based framework that uses a novel 3D awareness fusion block to incorporate depth geometry with a lightweight ConvNeXt decoder, and it relies on depth maps predicted by state-of-the-art estimators. Across five benchmark datasets, SurgDepth achieves state-of-the-art mean IoU, notably $0.862$ on SAR-RARP50, while using fewer parameters than prior methods, demonstrating both high accuracy and computational efficiency. This work advances practical surgical scene understanding by enabling robust, multi-modal segmentation suitable for real-time or near-real-time applications in operating rooms and related scenarios.

Abstract

Surgical scene understanding is a key technical component for enabling intelligent and context aware systems that can transform various aspects of surgical interventions. In this work, we focus on the semantic segmentation task, propose a simple yet effective multi-modal (RGB and depth) training framework called SurgDepth, and show state-of-the-art (SOTA) results on all publicly available datasets applicable for this task. Unlike previous approaches, which either fine-tune SOTA segmentation models trained on natural images, or encode RGB or RGB-D information using RGB only pre-trained backbones, SurgDepth, which is built on top of Vision Transformers (ViTs), is designed to encode both RGB and depth information through a simple fusion mechanism. We conduct extensive experiments on benchmark datasets including EndoVis2022, AutoLapro, LapI2I and EndoVis2017 to verify the efficacy of SurgDepth. Specifically, SurgDepth achieves a new SOTA IoU of 0.86 on EndoVis 2022 SAR-RARP50 challenge and outperforms the current best method by at least 4%, using a shallow and compute efficient decoder consisting of ConvNeXt blocks.

Rethinking RGB-D Fusion for Semantic Segmentation in Surgical Datasets

TL;DR

The paper tackles semantic segmentation in surgical scenes by leveraging RGB-D data to overcome occlusion and lighting challenges. It introduces SurgDepth, a Vision Transformer-based framework that uses a novel 3D awareness fusion block to incorporate depth geometry with a lightweight ConvNeXt decoder, and it relies on depth maps predicted by state-of-the-art estimators. Across five benchmark datasets, SurgDepth achieves state-of-the-art mean IoU, notably on SAR-RARP50, while using fewer parameters than prior methods, demonstrating both high accuracy and computational efficiency. This work advances practical surgical scene understanding by enabling robust, multi-modal segmentation suitable for real-time or near-real-time applications in operating rooms and related scenarios.

Abstract

Surgical scene understanding is a key technical component for enabling intelligent and context aware systems that can transform various aspects of surgical interventions. In this work, we focus on the semantic segmentation task, propose a simple yet effective multi-modal (RGB and depth) training framework called SurgDepth, and show state-of-the-art (SOTA) results on all publicly available datasets applicable for this task. Unlike previous approaches, which either fine-tune SOTA segmentation models trained on natural images, or encode RGB or RGB-D information using RGB only pre-trained backbones, SurgDepth, which is built on top of Vision Transformers (ViTs), is designed to encode both RGB and depth information through a simple fusion mechanism. We conduct extensive experiments on benchmark datasets including EndoVis2022, AutoLapro, LapI2I and EndoVis2017 to verify the efficacy of SurgDepth. Specifically, SurgDepth achieves a new SOTA IoU of 0.86 on EndoVis 2022 SAR-RARP50 challenge and outperforms the current best method by at least 4%, using a shallow and compute efficient decoder consisting of ConvNeXt blocks.
Paper Structure (23 sections, 2 equations, 3 figures, 6 tables)

This paper contains 23 sections, 2 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Predicted depth maps using DINOv2+DPT dinov2 and DepthAnything depthanything on SAR-RARP50 examples.
  • Figure 2: Overall architecture of SurgDepth. First, we encode the 3D geometric information using a 3D awareness fusion block and then encode the concatenated RGB-D in ViT-B. Then, the RGB features are passed to a shallow decoder head to predict the segmentation map.
  • Figure 3: 3D awareness fusion block.