Prompt-Guided Dual-Path UNet with Mamba for Medical Image Segmentation
Shaolei Zhang, Jinyan Liu, Tianyi Qian, Xuesong Li
TL;DR
The paper tackles the challenge of balancing local detail and global context in medical image segmentation by proposing PGM-UNet, a prompt-guided CNN-Mamba dual-path UNet that fuses local information (LIEM) with global cues from Mamba (PGRM) via a fusion module, all anchored by a multi-scale MIEM bottleneck. It introduces a prompt-guided residual Mamba and a Kolmogorov-Arnold Network–based multi-scale extractor to enhance contextual understanding without reducing resolution, while maintaining efficiency. Across ISIC-2017/2018, DIAS, DRIVE, and cross-dataset PH2 tests, PGM-UNet achieves state-of-the-art or competitive results with strong generalization, while remaining parameter-efficient (~5.48M). These findings demonstrate the practical value of integrating prompt-guided global modeling with parallel local-global fusion for robust medical image segmentation.
Abstract
Convolutional neural networks (CNNs) and transformers are widely employed in constructing UNet architectures for medical image segmentation tasks. However, CNNs struggle to model long-range dependencies, while transformers suffer from quadratic computational complexity. Recently, Mamba, a type of State Space Models, has gained attention for its exceptional ability to model long-range interactions while maintaining linear computational complexity. Despite the emergence of several Mamba-based methods, they still present the following limitations: first, their network designs generally lack perceptual capabilities for the original input data; second, they primarily focus on capturing global information, while often neglecting local details. To address these challenges, we propose a prompt-guided CNN-Mamba dual-path UNet, termed PGM-UNet, for medical image segmentation. Specifically, we introduce a prompt-guided residual Mamba module that adaptively extracts dynamic visual prompts from the original input data, effectively guiding Mamba in capturing global information. Additionally, we design a local-global information fusion network, comprising a local information extraction module, a prompt-guided residual Mamba module, and a multi-focus attention fusion module, which effectively integrates local and global information. Furthermore, inspired by Kolmogorov-Arnold Networks (KANs), we develop a multi-scale information extraction module to capture richer contextual information without altering the resolution. We conduct extensive experiments on the ISIC-2017, ISIC-2018, DIAS, and DRIVE. The results demonstrate that the proposed method significantly outperforms state-of-the-art approaches in multiple medical image segmentation tasks.
