StereoMamba: Real-time and Robust Intraoperative Stereo Disparity Estimation via Long-range Spatial Dependencies
Xu Wang, Jialang Xu, Shuai Zhang, Baoru Huang, Danail Stoyanov, Evangelos B. Mazomenos
TL;DR
StereoMamba targets real-time, robust stereo disparity estimation for robot-assisted minimally invasive surgery by introducing FE-Mamba, a feature extractor that combines self- and cross-attention, and MFF, a fusion module that integrates multi-scale cross-image information. The method constructs a group-wise cost volume via cross-attention–driven features and a cost-volume decoder, trained with a multi-output smooth L1 loss and soft-argmin disparity estimation. On the SCARED benchmark, StereoMamba achieves state-of-the-art depth accuracy (EPE $2.64$ px, depth MAE $2.55$ mm) while maintaining real-time inference at $21.28$ FPS on high-resolution inputs, and shows strong zero-shot generalization to in-vivo datasets with high SSIM/PSNR. Ablation confirms FE-Mamba and MFF provide clear gains in accuracy and speed, underscoring the importance of long-range spatial dependencies and cross-image fusion for RAMIS disparity estimation. Overall, the approach delivers a balanced solution that combines accuracy, robustness, and efficiency suitable for real-world RAMIS deployment.
Abstract
Stereo disparity estimation is crucial for obtaining depth information in robot-assisted minimally invasive surgery (RAMIS). While current deep learning methods have made significant advancements, challenges remain in achieving an optimal balance between accuracy, robustness, and inference speed. To address these challenges, we propose the StereoMamba architecture, which is specifically designed for stereo disparity estimation in RAMIS. Our approach is based on a novel Feature Extraction Mamba (FE-Mamba) module, which enhances long-range spatial dependencies both within and across stereo images. To effectively integrate multi-scale features from FE-Mamba, we then introduce a novel Multidimensional Feature Fusion (MFF) module. Experiments against the state-of-the-art on the ex-vivo SCARED benchmark demonstrate that StereoMamba achieves superior performance on EPE of 2.64 px and depth MAE of 2.55 mm, the second-best performance on Bad2 of 41.49% and Bad3 of 26.99%, while maintaining an inference speed of 21.28 FPS for a pair of high-resolution images (1280*1024), striking the optimum balance between accuracy, robustness, and efficiency. Furthermore, by comparing synthesized right images, generated from warping left images using the generated disparity maps, with the actual right image, StereoMamba achieves the best average SSIM (0.8970) and PSNR (16.0761), exhibiting strong zero-shot generalization on the in-vivo RIS2017 and StereoMIS datasets.
