Table of Contents
Fetching ...

HyM-UNet: Synergizing Local Texture and Global Context via Hybrid CNN-Mamba Architecture for Medical Image Segmentation

Haodong Chen, Xianfei Han, Qwen

TL;DR

HyM-UNet tackles the core challenge of balancing local texture detail with global context in medical image segmentation by fusing CNN-based shallow processing with Visual Mamba-driven deep context via a dual-stage encoder. The Mamba-Guided Fusion Skip connects encoder and decoder semantically to suppress noise and sharpen boundaries, while a hybrid loss emphasizes boundary accuracy. Across ISIC 2018, HyM-UNet yields higher Dice and IoU scores with lower parameter counts and latency compared to strong baselines, demonstrating robust performance on lesions with varied shapes and scales. The approach offers a practical, efficient path toward accurate clinical segmentation by combining low-cost local features with scalable global modeling.

Abstract

Accurate organ and lesion segmentation is a critical prerequisite for computer-aided diagnosis. Convolutional Neural Networks (CNNs), constrained by their local receptive fields, often struggle to capture complex global anatomical structures. To tackle this challenge, this paper proposes a novel hybrid architecture, HyM-UNet, designed to synergize the local feature extraction capabilities of CNNs with the efficient global modeling capabilities of Mamba. Specifically, we design a Hierarchical Encoder that utilizes convolutional modules in the shallow stages to preserve high-frequency texture details, while introducing Visual Mamba modules in the deep stages to capture long-range semantic dependencies with linear complexity. To bridge the semantic gap between the encoder and the decoder, we propose a Mamba-Guided Fusion Skip Connection (MGF-Skip). This module leverages deep semantic features as gating signals to dynamically suppress background noise within shallow features, thereby enhancing the perception of ambiguous boundaries. We conduct extensive experiments on public benchmark dataset ISIC 2018. The results demonstrate that HyM-UNet significantly outperforms existing state-of-the-art methods in terms of Dice coefficient and IoU, while maintaining lower parameter counts and inference latency. This validates the effectiveness and robustness of the proposed method in handling medical segmentation tasks characterized by complex shapes and scale variations.

HyM-UNet: Synergizing Local Texture and Global Context via Hybrid CNN-Mamba Architecture for Medical Image Segmentation

TL;DR

HyM-UNet tackles the core challenge of balancing local texture detail with global context in medical image segmentation by fusing CNN-based shallow processing with Visual Mamba-driven deep context via a dual-stage encoder. The Mamba-Guided Fusion Skip connects encoder and decoder semantically to suppress noise and sharpen boundaries, while a hybrid loss emphasizes boundary accuracy. Across ISIC 2018, HyM-UNet yields higher Dice and IoU scores with lower parameter counts and latency compared to strong baselines, demonstrating robust performance on lesions with varied shapes and scales. The approach offers a practical, efficient path toward accurate clinical segmentation by combining low-cost local features with scalable global modeling.

Abstract

Accurate organ and lesion segmentation is a critical prerequisite for computer-aided diagnosis. Convolutional Neural Networks (CNNs), constrained by their local receptive fields, often struggle to capture complex global anatomical structures. To tackle this challenge, this paper proposes a novel hybrid architecture, HyM-UNet, designed to synergize the local feature extraction capabilities of CNNs with the efficient global modeling capabilities of Mamba. Specifically, we design a Hierarchical Encoder that utilizes convolutional modules in the shallow stages to preserve high-frequency texture details, while introducing Visual Mamba modules in the deep stages to capture long-range semantic dependencies with linear complexity. To bridge the semantic gap between the encoder and the decoder, we propose a Mamba-Guided Fusion Skip Connection (MGF-Skip). This module leverages deep semantic features as gating signals to dynamically suppress background noise within shallow features, thereby enhancing the perception of ambiguous boundaries. We conduct extensive experiments on public benchmark dataset ISIC 2018. The results demonstrate that HyM-UNet significantly outperforms existing state-of-the-art methods in terms of Dice coefficient and IoU, while maintaining lower parameter counts and inference latency. This validates the effectiveness and robustness of the proposed method in handling medical segmentation tasks characterized by complex shapes and scale variations.

Paper Structure

This paper contains 18 sections, 11 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The overall architecture of the proposed HyM-UNet.
  • Figure 2: The visualization results.