Robust Real-Time Endoscopic Stereo Matching under Fuzzy Tissue Boundaries
Yang Ding, Can Han, Sijia Du, Yaqi Wang, Dahong Qian
TL;DR
This work addresses real-time depth estimation in endoscopic stereo matching under fuzzy tissue boundaries by introducing RRESM, a lightweight framework that couples a MobileNetV4-based feature extractor with a 3D Mamba Coordinate Attention module for efficient cost aggregation and a High-Frequency Disparity Optimization module for boundary refinement. The 3D MCA captures axis-specific long-range dependencies along disparity, height, and width, while HFDO leverages Haar wavelets to emphasize high-frequency boundary structures, yielding sharp disparity maps without heavy computation. Evaluations on SCARED and SERV-CT demonstrate state-of-the-art accuracy and real-time speed (≈$42$ FPS) on high-resolution inputs, with ablations confirming the complementary benefits of MCA and HFDO. These results suggest practical impact for robotic MIS by providing accurate depth in challenging boundary regions at clinically viable frame rates, while also outlining future work on uncertainty handling and multi-view extension.
Abstract
Real-time acquisition of accurate scene depth is essential for automated robotic minimally invasive surgery. Stereo matching with binocular endoscopy can provide this depth information. However, existing stereo matching methods, designed primarily for natural images, often struggle with endoscopic images due to fuzzy tissue boundaries and typically fail to meet real-time requirements for high-resolution endoscopic image inputs. To address these challenges, we propose \textbf{RRESM}, a real-time stereo matching method tailored for endoscopic images. Our approach integrates a 3D Mamba Coordinate Attention module that enhances cost aggregation through position-sensitive attention maps and long-range spatial dependency modeling via the Mamba block, generating a robust cost volume without substantial computational overhead. Additionally, we introduce a High-Frequency Disparity Optimization module that refines disparity predictions near tissue boundaries by amplifying high-frequency details in the wavelet domain. Evaluations on the SCARED and SERV-CT datasets demonstrate state-of-the-art matching accuracy with a real-time inference speed of 42 FPS. The code is available at https://github.com/Sonne-Ding/RRESM.
