MSV-Mamba: A Multiscale Vision Mamba Network for Echocardiography Segmentation
Xiaoxian Yang, Qi Wang, Kaiqi Zhang, Ke Wei, Jun Lyu, Lingchao Chen
TL;DR
This work tackles the challenge of accurate echocardiography segmentation under noisy, low-resolution conditions by introducing MSV-Mamba, a U-shaped network that combines a cascaded residual encoder with a large-window Mamba-based decoder. Key innovations include the LMS decoder blocks for global context with linear-like complexity, a Multiscale Attention Aggregation module for robust multilayer feature fusion via dual spatial-channel attention, and hierarchical auxiliary losses to supervise learning across decoder layers. Empirical results on EchoNet-Dynamic and CAMUS show superior performance in left ventricular endocardium and epicardium segmentation, with notable robustness to noise and morphological variation. The proposed approach offers a practical path toward real-time, reliable automatic echocardiography analysis, with potential extension to 3D reconstruction in future work.
Abstract
Ultrasound imaging frequently encounters challenges, such as those related to elevated noise levels, diminished spatiotemporal resolution, and the complexity of anatomical structures. These factors significantly hinder the model's ability to accurately capture and analyze structural relationships and dynamic patterns across various regions of the heart. Mamba, an emerging model, is one of the most cutting-edge approaches that is widely applied to diverse vision and language tasks. To this end, this paper introduces a U-shaped deep learning model incorporating a large-window Mamba scale (LMS) module and a hierarchical feature fusion approach for echocardiographic segmentation. First, a cascaded residual block serves as an encoder and is employed to incrementally extract multiscale detailed features. Second, a large-window multiscale mamba module is integrated into the decoder to capture global dependencies across regions and enhance the segmentation capability for complex anatomical structures. Furthermore, our model introduces auxiliary losses at each decoder layer and employs a dual attention mechanism to fuse multilayer features both spatially and across channels. This approach enhances segmentation performance and accuracy in delineating complex anatomical structures. Finally, the experimental results using the EchoNet-Dynamic and CAMUS datasets demonstrate that the model outperforms other methods in terms of both accuracy and robustness. For the segmentation of the left ventricular endocardium (${LV}_{endo}$), the model achieved optimal values of 95.01 and 93.36, respectively, while for the left ventricular epicardium (${LV}_{epi}$), values of 87.35 and 87.80, respectively, were achieved. This represents an improvement ranging between 0.54 and 1.11 compared with the best-performing model.
