Table of Contents
Fetching ...

UNet with Self-Adaptive Mamba-Like Attention and Causal-Resonance Learning for Medical Image Segmentation

Saqib Qamar, Mohd Fazil, Parvez Ahmad, Shakir Khan, Abu Taha Zamani

TL;DR

This work tackles the trade-off between efficiency and accuracy in medical image segmentation by introducing SAMA-UNet, which combines Self-Adaptive Mamba-like Aggregated Attention (SAMA) with a Causal-Resonance Multi-Scale Module (CR-MSM). SAMA provides a dynamic, channel-split attention mechanism that fuses local and global features with linear complexity, while CR-MSM preserves causal, multi-scale information flow during encoder–decoder fusion. Across MRI, CT, and endoscopy datasets, SAMA-UNet achieves state-of-the-art or competitive DSC and NSD scores with lower computational overhead than many transformer- and Mamba-based baselines. The work demonstrates both technical and clinical relevance by delivering high segmentation accuracy with scalable efficiency, and it outlines concrete future directions for memory-efficient 3D extensions and real-time deployment.

Abstract

Medical image segmentation plays an important role in various clinical applications; however, existing deep learning models face trade-offs between efficiency and accuracy. Convolutional Neural Networks (CNNs) capture local details well but miss the global context, whereas transformers handle the global context but at a high computational cost. Recently, State Space Sequence Models (SSMs) have shown potential for capturing long-range dependencies with linear complexity; however, their direct use in medical image segmentation remains limited due to incompatibility with image structures and autoregressive assumptions. To overcome these challenges, we propose SAMA-UNet, a novel U-shaped architecture that introduces two key innovations. First, the Self-Adaptive Mamba-like Aggregated Attention (SAMA) block adaptively integrates local and global features through dynamic attention weighting, enabling an efficient representation of complex anatomical patterns. Second, the causal resonance multi-scale module (CR-MSM) improves encoder-decoder interactions by adjusting feature resolution and causal dependencies across scales, enhancing the semantic alignment between low- and high-level features. Extensive experiments on MRI, CT, and endoscopy datasets demonstrate that SAMA-UNet consistently outperforms CNN, Transformer, and Mamba-based methods. It achieves 85.38% DSC and 87.82% NSD on BTCV, 92.16% and 96.54% on ACDC, 67.14% and 68.70% on EndoVis17, and 84.06% and 88.47% on ATLAS23, establishing new benchmarks across modalities. These results confirm the effectiveness of SAMA-UNet in combining efficiency and accuracy, making it a promising solution for real-world clinical segmentation tasks. The source code is available on GitHub.

UNet with Self-Adaptive Mamba-Like Attention and Causal-Resonance Learning for Medical Image Segmentation

TL;DR

This work tackles the trade-off between efficiency and accuracy in medical image segmentation by introducing SAMA-UNet, which combines Self-Adaptive Mamba-like Aggregated Attention (SAMA) with a Causal-Resonance Multi-Scale Module (CR-MSM). SAMA provides a dynamic, channel-split attention mechanism that fuses local and global features with linear complexity, while CR-MSM preserves causal, multi-scale information flow during encoder–decoder fusion. Across MRI, CT, and endoscopy datasets, SAMA-UNet achieves state-of-the-art or competitive DSC and NSD scores with lower computational overhead than many transformer- and Mamba-based baselines. The work demonstrates both technical and clinical relevance by delivering high segmentation accuracy with scalable efficiency, and it outlines concrete future directions for memory-efficient 3D extensions and real-time deployment.

Abstract

Medical image segmentation plays an important role in various clinical applications; however, existing deep learning models face trade-offs between efficiency and accuracy. Convolutional Neural Networks (CNNs) capture local details well but miss the global context, whereas transformers handle the global context but at a high computational cost. Recently, State Space Sequence Models (SSMs) have shown potential for capturing long-range dependencies with linear complexity; however, their direct use in medical image segmentation remains limited due to incompatibility with image structures and autoregressive assumptions. To overcome these challenges, we propose SAMA-UNet, a novel U-shaped architecture that introduces two key innovations. First, the Self-Adaptive Mamba-like Aggregated Attention (SAMA) block adaptively integrates local and global features through dynamic attention weighting, enabling an efficient representation of complex anatomical patterns. Second, the causal resonance multi-scale module (CR-MSM) improves encoder-decoder interactions by adjusting feature resolution and causal dependencies across scales, enhancing the semantic alignment between low- and high-level features. Extensive experiments on MRI, CT, and endoscopy datasets demonstrate that SAMA-UNet consistently outperforms CNN, Transformer, and Mamba-based methods. It achieves 85.38% DSC and 87.82% NSD on BTCV, 92.16% and 96.54% on ACDC, 67.14% and 68.70% on EndoVis17, and 84.06% and 88.47% on ATLAS23, establishing new benchmarks across modalities. These results confirm the effectiveness of SAMA-UNet in combining efficiency and accuracy, making it a promising solution for real-world clinical segmentation tasks. The source code is available on GitHub.

Paper Structure

This paper contains 24 sections, 6 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the SAMA-UNet architecture. SAMA is used in the encoder, residual convolution blocks are used in the decoder, and a CR-SMM with implicit causality is applied in the skip connections. The SAMA block includes a token mixer sub-block with a modified pixel-focused attention mechanism and a Feed-Forward Neural Network (FFN) sub-block.
  • Figure 2: Illustration of the structure designs of Mamba, Mamba-Like Linear Attention, and our SAMA block.
  • Figure 3: Visualization of segmentation examples Synapse multi-organ dataset (BTCV, 1st row), Automated Cardiac Diagnosis Challenge dataset (ACDC, 2nd), endoscopy images dataset (Endovis17, 3rd row), and liver tumor dataset (ATLAS23, 4th row). The result column of our proposed SAMA-UNet method is shown in the last column. SAMA-UNet is more robust to heterogeneous appearances and has fewer segmentation outliers.
  • Figure 4: Illustration of performance variations observed through ablation studies, highlighting the impact of different architectural components on segmentation accuracy. Evaluations are based on DSC, NSD using the BTCV dataset.
  • Figure 5: Impact of different types of token mixers on model GFLOPs and parameter count. DiffAtt and ML refer to differential attention and Mamba-like macro structure, respectively.