Rethinking RGB-D Fusion for Semantic Segmentation in Surgical Datasets
Muhammad Abdullah Jamal, Omid Mohareri
TL;DR
The paper tackles semantic segmentation in surgical scenes by leveraging RGB-D data to overcome occlusion and lighting challenges. It introduces SurgDepth, a Vision Transformer-based framework that uses a novel 3D awareness fusion block to incorporate depth geometry with a lightweight ConvNeXt decoder, and it relies on depth maps predicted by state-of-the-art estimators. Across five benchmark datasets, SurgDepth achieves state-of-the-art mean IoU, notably $0.862$ on SAR-RARP50, while using fewer parameters than prior methods, demonstrating both high accuracy and computational efficiency. This work advances practical surgical scene understanding by enabling robust, multi-modal segmentation suitable for real-time or near-real-time applications in operating rooms and related scenarios.
Abstract
Surgical scene understanding is a key technical component for enabling intelligent and context aware systems that can transform various aspects of surgical interventions. In this work, we focus on the semantic segmentation task, propose a simple yet effective multi-modal (RGB and depth) training framework called SurgDepth, and show state-of-the-art (SOTA) results on all publicly available datasets applicable for this task. Unlike previous approaches, which either fine-tune SOTA segmentation models trained on natural images, or encode RGB or RGB-D information using RGB only pre-trained backbones, SurgDepth, which is built on top of Vision Transformers (ViTs), is designed to encode both RGB and depth information through a simple fusion mechanism. We conduct extensive experiments on benchmark datasets including EndoVis2022, AutoLapro, LapI2I and EndoVis2017 to verify the efficacy of SurgDepth. Specifically, SurgDepth achieves a new SOTA IoU of 0.86 on EndoVis 2022 SAR-RARP50 challenge and outperforms the current best method by at least 4%, using a shallow and compute efficient decoder consisting of ConvNeXt blocks.
