Table of Contents
Fetching ...

HST-MRF: Heterogeneous Swin Transformer with Multi-Receptive Field for Medical Image Segmentation

Xiaofei Huang, Hongfang Gong, Jin Zhang

TL;DR

The paper tackles structural information loss caused by patch-based Transformer segmentation in medical imaging. It introduces HST-MRF, a U-Net–style architecture that fuses multi-receptive-field patch information via a heterogeneous Swin Transformer (HST) and a two-stage multimodal bilinear pooling (MBP) decoder with soft channel attention (SCA) and adaptive patch embedding (APE). A composite loss with weighted IoU, weighted BCE, and Tversky terms, plus deep supervision, enables robust training. Empirical results on polyp and skin lesion datasets show state-of-the-art performance and ablations confirm the value of each module and multi-receptive-field fusion, suggesting practical impact for accurate lesion localization and potential extension to related medical imaging tasks.

Abstract

The Transformer has been successfully used in medical image segmentation due to its excellent long-range modeling capabilities. However, patch segmentation is necessary when building a Transformer class model. This process may disrupt the tissue structure in medical images, resulting in the loss of relevant information. In this study, we proposed a Heterogeneous Swin Transformer with Multi-Receptive Field (HST-MRF) model based on U-shaped networks for medical image segmentation. The main purpose is to solve the problem of loss of structural information caused by patch segmentation using transformer by fusing patch information under different receptive fields. The heterogeneous Swin Transformer (HST) is the core module, which achieves the interaction of multi-receptive field patch information through heterogeneous attention and passes it to the next stage for progressive learning. We also designed a two-stage fusion module, multimodal bilinear pooling (MBP), to assist HST in further fusing multi-receptive field information and combining low-level and high-level semantic information for accurate localization of lesion regions. In addition, we developed adaptive patch embedding (APE) and soft channel attention (SCA) modules to retain more valuable information when acquiring patch embedding and filtering channel features, respectively, thereby improving model segmentation quality. We evaluated HST-MRF on multiple datasets for polyp and skin lesion segmentation tasks. Experimental results show that our proposed method outperforms state-of-the-art models and can achieve superior performance. Furthermore, we verified the effectiveness of each module and the benefits of multi-receptive field segmentation in reducing the loss of structural information through ablation experiments.

HST-MRF: Heterogeneous Swin Transformer with Multi-Receptive Field for Medical Image Segmentation

TL;DR

The paper tackles structural information loss caused by patch-based Transformer segmentation in medical imaging. It introduces HST-MRF, a U-Net–style architecture that fuses multi-receptive-field patch information via a heterogeneous Swin Transformer (HST) and a two-stage multimodal bilinear pooling (MBP) decoder with soft channel attention (SCA) and adaptive patch embedding (APE). A composite loss with weighted IoU, weighted BCE, and Tversky terms, plus deep supervision, enables robust training. Empirical results on polyp and skin lesion datasets show state-of-the-art performance and ablations confirm the value of each module and multi-receptive-field fusion, suggesting practical impact for accurate lesion localization and potential extension to related medical imaging tasks.

Abstract

The Transformer has been successfully used in medical image segmentation due to its excellent long-range modeling capabilities. However, patch segmentation is necessary when building a Transformer class model. This process may disrupt the tissue structure in medical images, resulting in the loss of relevant information. In this study, we proposed a Heterogeneous Swin Transformer with Multi-Receptive Field (HST-MRF) model based on U-shaped networks for medical image segmentation. The main purpose is to solve the problem of loss of structural information caused by patch segmentation using transformer by fusing patch information under different receptive fields. The heterogeneous Swin Transformer (HST) is the core module, which achieves the interaction of multi-receptive field patch information through heterogeneous attention and passes it to the next stage for progressive learning. We also designed a two-stage fusion module, multimodal bilinear pooling (MBP), to assist HST in further fusing multi-receptive field information and combining low-level and high-level semantic information for accurate localization of lesion regions. In addition, we developed adaptive patch embedding (APE) and soft channel attention (SCA) modules to retain more valuable information when acquiring patch embedding and filtering channel features, respectively, thereby improving model segmentation quality. We evaluated HST-MRF on multiple datasets for polyp and skin lesion segmentation tasks. Experimental results show that our proposed method outperforms state-of-the-art models and can achieve superior performance. Furthermore, we verified the effectiveness of each module and the benefits of multi-receptive field segmentation in reducing the loss of structural information through ablation experiments.
Paper Structure (24 sections, 13 equations, 6 figures, 4 tables)

This paper contains 24 sections, 13 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Our proposed heterogeneous Swin Transformer with multiple receptive fields (HST-MRF). We utilized APE and HST modules in the encoding process, and five MBP modules (${\rm MBP}_1$ to ${\rm MBP}_5$), five SCA modules (${\rm SCA}_1$ to ${\rm SCA}_5$) in the decoding process.
  • Figure 2: Description of heterogeneous attention computation in HST.
  • Figure 3: The operation process of ${\rm MBP}_1$. First, we perform Hadamard product between $X^1_5$ and $X^2_5$, and then obtain the output $D_1$ of ${\rm MBP}_1$ through Convs.
  • Figure 4: The operation process of ${\rm MBP}_t$ ($t>1$). First, we obtain the overall low-level semantic information $X^1_{6-t}\circ X^2_{6-t}$, and the upsampled output $Y_{t-1}$, ${\rm Up}(Y_{t-1})$. Then, we concatenate them and apply Convs to get the output $D_t$ of ${\rm MBP}_t$.
  • Figure 5: The operation process of ${\rm SCA}_t$. Based on the feature map $D_t$, the attention weight $\beta_t$ is obtained for each channel, and then the output $Y_t$ of the ${\rm SCA}_t$ is obtained through residual connections.
  • ...and 1 more figures