Table of Contents
Fetching ...

Prompt-Guided Dual-Path UNet with Mamba for Medical Image Segmentation

Shaolei Zhang, Jinyan Liu, Tianyi Qian, Xuesong Li

TL;DR

The paper tackles the challenge of balancing local detail and global context in medical image segmentation by proposing PGM-UNet, a prompt-guided CNN-Mamba dual-path UNet that fuses local information (LIEM) with global cues from Mamba (PGRM) via a fusion module, all anchored by a multi-scale MIEM bottleneck. It introduces a prompt-guided residual Mamba and a Kolmogorov-Arnold Network–based multi-scale extractor to enhance contextual understanding without reducing resolution, while maintaining efficiency. Across ISIC-2017/2018, DIAS, DRIVE, and cross-dataset PH2 tests, PGM-UNet achieves state-of-the-art or competitive results with strong generalization, while remaining parameter-efficient (~5.48M). These findings demonstrate the practical value of integrating prompt-guided global modeling with parallel local-global fusion for robust medical image segmentation.

Abstract

Convolutional neural networks (CNNs) and transformers are widely employed in constructing UNet architectures for medical image segmentation tasks. However, CNNs struggle to model long-range dependencies, while transformers suffer from quadratic computational complexity. Recently, Mamba, a type of State Space Models, has gained attention for its exceptional ability to model long-range interactions while maintaining linear computational complexity. Despite the emergence of several Mamba-based methods, they still present the following limitations: first, their network designs generally lack perceptual capabilities for the original input data; second, they primarily focus on capturing global information, while often neglecting local details. To address these challenges, we propose a prompt-guided CNN-Mamba dual-path UNet, termed PGM-UNet, for medical image segmentation. Specifically, we introduce a prompt-guided residual Mamba module that adaptively extracts dynamic visual prompts from the original input data, effectively guiding Mamba in capturing global information. Additionally, we design a local-global information fusion network, comprising a local information extraction module, a prompt-guided residual Mamba module, and a multi-focus attention fusion module, which effectively integrates local and global information. Furthermore, inspired by Kolmogorov-Arnold Networks (KANs), we develop a multi-scale information extraction module to capture richer contextual information without altering the resolution. We conduct extensive experiments on the ISIC-2017, ISIC-2018, DIAS, and DRIVE. The results demonstrate that the proposed method significantly outperforms state-of-the-art approaches in multiple medical image segmentation tasks.

Prompt-Guided Dual-Path UNet with Mamba for Medical Image Segmentation

TL;DR

The paper tackles the challenge of balancing local detail and global context in medical image segmentation by proposing PGM-UNet, a prompt-guided CNN-Mamba dual-path UNet that fuses local information (LIEM) with global cues from Mamba (PGRM) via a fusion module, all anchored by a multi-scale MIEM bottleneck. It introduces a prompt-guided residual Mamba and a Kolmogorov-Arnold Network–based multi-scale extractor to enhance contextual understanding without reducing resolution, while maintaining efficiency. Across ISIC-2017/2018, DIAS, DRIVE, and cross-dataset PH2 tests, PGM-UNet achieves state-of-the-art or competitive results with strong generalization, while remaining parameter-efficient (~5.48M). These findings demonstrate the practical value of integrating prompt-guided global modeling with parallel local-global fusion for robust medical image segmentation.

Abstract

Convolutional neural networks (CNNs) and transformers are widely employed in constructing UNet architectures for medical image segmentation tasks. However, CNNs struggle to model long-range dependencies, while transformers suffer from quadratic computational complexity. Recently, Mamba, a type of State Space Models, has gained attention for its exceptional ability to model long-range interactions while maintaining linear computational complexity. Despite the emergence of several Mamba-based methods, they still present the following limitations: first, their network designs generally lack perceptual capabilities for the original input data; second, they primarily focus on capturing global information, while often neglecting local details. To address these challenges, we propose a prompt-guided CNN-Mamba dual-path UNet, termed PGM-UNet, for medical image segmentation. Specifically, we introduce a prompt-guided residual Mamba module that adaptively extracts dynamic visual prompts from the original input data, effectively guiding Mamba in capturing global information. Additionally, we design a local-global information fusion network, comprising a local information extraction module, a prompt-guided residual Mamba module, and a multi-focus attention fusion module, which effectively integrates local and global information. Furthermore, inspired by Kolmogorov-Arnold Networks (KANs), we develop a multi-scale information extraction module to capture richer contextual information without altering the resolution. We conduct extensive experiments on the ISIC-2017, ISIC-2018, DIAS, and DRIVE. The results demonstrate that the proposed method significantly outperforms state-of-the-art approaches in multiple medical image segmentation tasks.

Paper Structure

This paper contains 18 sections, 10 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Visualization of comparison results on the ISIC-2018. The X-axis represents the number of parameters (lower is better), while the Y-axis indicates the DSC values (higher is better). The proposed strategy achieves a good trade-off between segmentation accuracy and parameter efficiency.
  • Figure 2: Overview of the proposed PGM-UNet architecture. The encoder and decoder are primarily composed of the LG-Net. The LIEM is used to extract local information, while the PGRM extracts global information under the guidance of prompt information. The MAFM employs channel attention mechanisms to reweight the local information from LIEM and the global information from PGRM, enabling effective integration. Additionally, the MIEM is constructed by combining multiple dilated convolutions and KANs, which serves as the bottleneck layer.
  • Figure 3: Visual comparison of different medical image segmentation approaches on skin lesion images selected from the ISIC-2017 and ISIC-2018. Red lines indicate the boundaries of the labels.
  • Figure 4: Visual comparison of different medical image segmentation approaches on vessel segmentation images selected from the DIAS and DRIVE. Zooming in on the red screen enhances the viewing experience.