Table of Contents
Fetching ...

Robust Real-Time Endoscopic Stereo Matching under Fuzzy Tissue Boundaries

Yang Ding, Can Han, Sijia Du, Yaqi Wang, Dahong Qian

TL;DR

This work addresses real-time depth estimation in endoscopic stereo matching under fuzzy tissue boundaries by introducing RRESM, a lightweight framework that couples a MobileNetV4-based feature extractor with a 3D Mamba Coordinate Attention module for efficient cost aggregation and a High-Frequency Disparity Optimization module for boundary refinement. The 3D MCA captures axis-specific long-range dependencies along disparity, height, and width, while HFDO leverages Haar wavelets to emphasize high-frequency boundary structures, yielding sharp disparity maps without heavy computation. Evaluations on SCARED and SERV-CT demonstrate state-of-the-art accuracy and real-time speed (≈$42$ FPS) on high-resolution inputs, with ablations confirming the complementary benefits of MCA and HFDO. These results suggest practical impact for robotic MIS by providing accurate depth in challenging boundary regions at clinically viable frame rates, while also outlining future work on uncertainty handling and multi-view extension.

Abstract

Real-time acquisition of accurate scene depth is essential for automated robotic minimally invasive surgery. Stereo matching with binocular endoscopy can provide this depth information. However, existing stereo matching methods, designed primarily for natural images, often struggle with endoscopic images due to fuzzy tissue boundaries and typically fail to meet real-time requirements for high-resolution endoscopic image inputs. To address these challenges, we propose \textbf{RRESM}, a real-time stereo matching method tailored for endoscopic images. Our approach integrates a 3D Mamba Coordinate Attention module that enhances cost aggregation through position-sensitive attention maps and long-range spatial dependency modeling via the Mamba block, generating a robust cost volume without substantial computational overhead. Additionally, we introduce a High-Frequency Disparity Optimization module that refines disparity predictions near tissue boundaries by amplifying high-frequency details in the wavelet domain. Evaluations on the SCARED and SERV-CT datasets demonstrate state-of-the-art matching accuracy with a real-time inference speed of 42 FPS. The code is available at https://github.com/Sonne-Ding/RRESM.

Robust Real-Time Endoscopic Stereo Matching under Fuzzy Tissue Boundaries

TL;DR

This work addresses real-time depth estimation in endoscopic stereo matching under fuzzy tissue boundaries by introducing RRESM, a lightweight framework that couples a MobileNetV4-based feature extractor with a 3D Mamba Coordinate Attention module for efficient cost aggregation and a High-Frequency Disparity Optimization module for boundary refinement. The 3D MCA captures axis-specific long-range dependencies along disparity, height, and width, while HFDO leverages Haar wavelets to emphasize high-frequency boundary structures, yielding sharp disparity maps without heavy computation. Evaluations on SCARED and SERV-CT demonstrate state-of-the-art accuracy and real-time speed (≈ FPS) on high-resolution inputs, with ablations confirming the complementary benefits of MCA and HFDO. These results suggest practical impact for robotic MIS by providing accurate depth in challenging boundary regions at clinically viable frame rates, while also outlining future work on uncertainty handling and multi-view extension.

Abstract

Real-time acquisition of accurate scene depth is essential for automated robotic minimally invasive surgery. Stereo matching with binocular endoscopy can provide this depth information. However, existing stereo matching methods, designed primarily for natural images, often struggle with endoscopic images due to fuzzy tissue boundaries and typically fail to meet real-time requirements for high-resolution endoscopic image inputs. To address these challenges, we propose \textbf{RRESM}, a real-time stereo matching method tailored for endoscopic images. Our approach integrates a 3D Mamba Coordinate Attention module that enhances cost aggregation through position-sensitive attention maps and long-range spatial dependency modeling via the Mamba block, generating a robust cost volume without substantial computational overhead. Additionally, we introduce a High-Frequency Disparity Optimization module that refines disparity predictions near tissue boundaries by amplifying high-frequency details in the wavelet domain. Evaluations on the SCARED and SERV-CT datasets demonstrate state-of-the-art matching accuracy with a real-time inference speed of 42 FPS. The code is available at https://github.com/Sonne-Ding/RRESM.

Paper Structure

This paper contains 17 sections, 12 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Depth error maps near fuzzy boundaries on the SCARED dataset. Brighter regions indicate larger depth errors. Our method performs well in estimating depth around fuzzy tissue boundaries, outperforming the state-of-the-art natural image methods (e.g., GwcNet and IGEV).
  • Figure 2: Overall architecture of RRESM. The Feature Net adopts a U-Net-like structure, with a frozen encoder based on MobileNetV4 and a trainable decoder. A correlation cost volume is constructed using group-wise correlation. The MCA module is embedded within a simplified 3D CNN to enhance cost aggregation. The HFDO module takes deep features from the encoder as contextual information, applies a wavelet-based filter, and enhances high-frequency components in the disparity map. Details of the MCA and HFDO modules are shown in Fig.\ref{['fig:module1']}\ref{['fig:module2']}, respectively.
  • Figure 3: (a) Architecture of the MCA module. Attention is computed independently along the $H$, $W$, and $D$ dimensions, concatenated along the channel axis, and passed through a Bi-Mamba2 layer. (b) Implementation of the Bi-Mamba2 layer in MCA. A bidirectional scan is performed over the concatenated axis-pooled features in both forward and backward directions. (c) Re-weighting operation. The resulting position-sensitive attention maps assign a unique weight to each coordinate in the cost volume.
  • Figure 4: Wavelet Transform Refine module. The context feature map is decomposed into low-frequency (LL) and high-frequency (LH, HL, HH) components. The LL components are attenuated by a parameter $w$.
  • Figure 5: Visualization of disparity estimation on the SCARED and SERV-CT datasets. (a) and (b) are from SCARED, while (c) and (d) are from SERV-CT. In cases with limited depth variation, such as (b), most methods perform similarly. However, in high-frequency regions like surgical tool-tissue boundaries in (a), RRESM yields more accurate depth predictions. On SERV-CT, our method also delivers competitive results. Note: MAE is measured in millimeters (mm) for SCARED (with ground-truth depth) and in pixels for SERV-CT (with ground-truth disparity).
  • ...and 1 more figures