Table of Contents
Fetching ...

BEFUnet: A Hybrid CNN-Transformer Architecture for Precise Medical Image Segmentation

Omid Nejati Manzari, Javad Mirzapour Kaleybar, Hooman Saadat, Shahin Maleki

TL;DR

BEFUnet addresses the challenge of precise medical image segmentation by fusing edge-focused CNN features with body-context Transformer features through a dual-branch encoder. It introduces a Local Cross-Attention Fusion (LCAF) module for efficient cross-modal fusion and a Double-Level Fusion (DLF) module to integrate coarse and fine representations across scales. Evaluated on Synapse, SegPC, and ISIC datasets, BEFUnet demonstrates state-of-the-art performance on multiple metrics and modalities, with particularly strong boundary delineation and robustness to variation in shape and texture. The proposed approach offers practical impact for clinical imaging by delivering accurate segmentations with efficient computation and strong generalization.

Abstract

The accurate segmentation of medical images is critical for various healthcare applications. Convolutional neural networks (CNNs), especially Fully Convolutional Networks (FCNs) like U-Net, have shown remarkable success in medical image segmentation tasks. However, they have limitations in capturing global context and long-range relations, especially for objects with significant variations in shape, scale, and texture. While transformers have achieved state-of-the-art results in natural language processing and image recognition, they face challenges in medical image segmentation due to image locality and translational invariance issues. To address these challenges, this paper proposes an innovative U-shaped network called BEFUnet, which enhances the fusion of body and edge information for precise medical image segmentation. The BEFUnet comprises three main modules, including a novel Local Cross-Attention Feature (LCAF) fusion module, a novel Double-Level Fusion (DLF) module, and dual-branch encoder. The dual-branch encoder consists of an edge encoder and a body encoder. The edge encoder employs PDC blocks for effective edge information extraction, while the body encoder uses the Swin Transformer to capture semantic information with global attention. The LCAF module efficiently fuses edge and body features by selectively performing local cross-attention on features that are spatially close between the two modalities. This local approach significantly reduces computational complexity compared to global cross-attention while ensuring accurate feature matching. BEFUnet demonstrates superior performance over existing methods across various evaluation metrics on medical image segmentation datasets.

BEFUnet: A Hybrid CNN-Transformer Architecture for Precise Medical Image Segmentation

TL;DR

BEFUnet addresses the challenge of precise medical image segmentation by fusing edge-focused CNN features with body-context Transformer features through a dual-branch encoder. It introduces a Local Cross-Attention Fusion (LCAF) module for efficient cross-modal fusion and a Double-Level Fusion (DLF) module to integrate coarse and fine representations across scales. Evaluated on Synapse, SegPC, and ISIC datasets, BEFUnet demonstrates state-of-the-art performance on multiple metrics and modalities, with particularly strong boundary delineation and robustness to variation in shape and texture. The proposed approach offers practical impact for clinical imaging by delivering accurate segmentations with efficient computation and strong generalization.

Abstract

The accurate segmentation of medical images is critical for various healthcare applications. Convolutional neural networks (CNNs), especially Fully Convolutional Networks (FCNs) like U-Net, have shown remarkable success in medical image segmentation tasks. However, they have limitations in capturing global context and long-range relations, especially for objects with significant variations in shape, scale, and texture. While transformers have achieved state-of-the-art results in natural language processing and image recognition, they face challenges in medical image segmentation due to image locality and translational invariance issues. To address these challenges, this paper proposes an innovative U-shaped network called BEFUnet, which enhances the fusion of body and edge information for precise medical image segmentation. The BEFUnet comprises three main modules, including a novel Local Cross-Attention Feature (LCAF) fusion module, a novel Double-Level Fusion (DLF) module, and dual-branch encoder. The dual-branch encoder consists of an edge encoder and a body encoder. The edge encoder employs PDC blocks for effective edge information extraction, while the body encoder uses the Swin Transformer to capture semantic information with global attention. The LCAF module efficiently fuses edge and body features by selectively performing local cross-attention on features that are spatially close between the two modalities. This local approach significantly reduces computational complexity compared to global cross-attention while ensuring accurate feature matching. BEFUnet demonstrates superior performance over existing methods across various evaluation metrics on medical image segmentation datasets.
Paper Structure (24 sections, 15 equations, 6 figures, 4 tables)

This paper contains 24 sections, 15 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The structure of our proposed BEFUnet. BEFUnet consists of a dual-encoder that simultaneously extracts edge and body features. Moreover, it incorporates an efficient fusion module called LCAF, which facilitates the merging of edge and body features. Additionally, there is a DLF module integrated into the skip connection of the encoder-decoder structure to fully integrate features from adjacent scales.
  • Figure 2: A block diagram of the LCAF.
  • Figure 3: The Cross Attention process entails several steps. Initially, the class token of the small level, denoted as $CLS^s$, is projected for dimension alignment and then appended to $P^l$. The resulting embedding operates as both the key and value. Moreover, the query is made using $CLS'^s$. Subsequently, attention computation and back projection are performed to obtain $Z^s$. Noteworthily, this process can also be extended to the large level.
  • Figure 4: Segmentation results of the proposed method on the Synapse dataset. The red rectangles identify organ regions where the superiority of our proposed method can be clearly seen.
  • Figure 5: Visual representation of the proposed method on the SegPC cell segmentation dataset.
  • ...and 1 more figures