MambaCAFU: Hybrid Multi-Scale and Multi-Attention Model with Mamba-Based Fusion for Medical Image Segmentation
T-Mai Bui, Fares Bougourzi, Fadi Dornaika, Vinh Truong Hoang
TL;DR
MambaCAFU targets cross-task generalization and efficiency in medical image segmentation by integrating a three-branch encoder (CNN, Transformer, Mamba-based fusion) with a multi-scale, co-attentive decoder. The core innovations—the Mamba-based Attention Fusion (MAF), CoASMamba, CoAMamba, and DoubleLCoA blocks—enable efficient fusion of local, global, and long-range features across scales. The approach demonstrates state-of-the-art or competitive performance across six diverse benchmarks (Synapse, BTCV, ACDC, ISIC 2017, GlaS, MoNuSeg) with balanced computational demands, supporting practical deployment in clinical settings. By coupling robust accuracy with scalability and reproducibility, MambaCAFU offers a versatile solution for heterogeneous MIS tasks and modalities, with code and models to be released upon acceptance.
Abstract
In recent years, deep learning has shown near-expert performance in segmenting complex medical tissues and tumors. However, existing models are often task-specific, with performance varying across modalities and anatomical regions. Balancing model complexity and performance remains challenging, particularly in clinical settings where both accuracy and efficiency are critical. To address these issues, we propose a hybrid segmentation architecture featuring a three-branch encoder that integrates CNNs, Transformers, and a Mamba-based Attention Fusion (MAF) mechanism to capture local, global, and long-range dependencies. A multi-scale attention-based CNN decoder reconstructs fine-grained segmentation maps while preserving contextual consistency. Additionally, a co-attention gate enhances feature selection by emphasizing relevant spatial and semantic information across scales during both encoding and decoding, improving feature interaction and cross-scale communication. Extensive experiments on multiple benchmark datasets show that our approach outperforms state-of-the-art methods in accuracy and generalization, while maintaining comparable computational complexity. By effectively balancing efficiency and effectiveness, our architecture offers a practical and scalable solution for diverse medical imaging tasks. Source code and trained models will be publicly released upon acceptance to support reproducibility and further research.
