Table of Contents
Fetching ...

PMFSNet: Polarized Multi-scale Feature Self-attention Network For Lightweight Medical Image Segmentation

Jiahui Zhong, Wenhong Tian, Yuanlun Xie, Zhijia Liu, Jie Ou, Taoran Tian, Lei Zhang

TL;DR

PMFSNet introduces a lightweight, UNet-based segmentation framework that prioritizes efficiency without sacrificing accuracy by placing a Polarized Multi-scale Feature Self-attention (PMFS) block at the bottleneck. The PMFS block combines Adaptive Multi-branch Feature Fusion with Polarized Multi-scale Channel and Spatial Self-attention modules to expand the number of attention points and capture global context across scales, while using depthwise separable convolutions to reduce complexity. Across 3D CBCT Tooth, MMOTU ultrasound, and ISIC 2018 dermoscopy datasets, PMFSNet achieves competitive IoU scores with far fewer parameters (under 1 million) and lower FLOPs than many state-of-the-art transformers and CNNs, enabling deployment on edge devices. Ablation studies confirm the PMFS block’s effectiveness and its plug-and-play potential for other UNet-based architectures, underscoring a practical pathway toward efficient, scalable medical image segmentation.

Abstract

Current state-of-the-art medical image segmentation methods prioritize accuracy but often at the expense of increased computational demands and larger model sizes. Applying these large-scale models to the relatively limited scale of medical image datasets tends to induce redundant computation, complicating the process without the necessary benefits. This approach not only adds complexity but also presents challenges for the integration and deployment of lightweight models on edge devices. For instance, recent transformer-based models have excelled in 2D and 3D medical image segmentation due to their extensive receptive fields and high parameter count. However, their effectiveness comes with a risk of overfitting when applied to small datasets and often neglects the vital inductive biases of Convolutional Neural Networks (CNNs), essential for local feature representation. In this work, we propose PMFSNet, a novel medical imaging segmentation model that effectively balances global and local feature processing while avoiding the computational redundancy typical in larger models. PMFSNet streamlines the UNet-based hierarchical structure and simplifies the self-attention mechanism's computational complexity, making it suitable for lightweight applications. It incorporates a plug-and-play PMFS block, a multi-scale feature enhancement module based on attention mechanisms, to capture long-term dependencies. Extensive comprehensive results demonstrate that even with a model (less than 1 million parameters), our method achieves superior performance in various segmentation tasks across different data scales. It achieves (IoU) metrics of 84.68%, 82.02%, and 78.82% on public datasets of teeth CT (CBCT), ovarian tumors ultrasound(MMOTU), and skin lesions dermoscopy images (ISIC 2018), respectively. The source code is available at https://github.com/yykzjh/PMFSNet.

PMFSNet: Polarized Multi-scale Feature Self-attention Network For Lightweight Medical Image Segmentation

TL;DR

PMFSNet introduces a lightweight, UNet-based segmentation framework that prioritizes efficiency without sacrificing accuracy by placing a Polarized Multi-scale Feature Self-attention (PMFS) block at the bottleneck. The PMFS block combines Adaptive Multi-branch Feature Fusion with Polarized Multi-scale Channel and Spatial Self-attention modules to expand the number of attention points and capture global context across scales, while using depthwise separable convolutions to reduce complexity. Across 3D CBCT Tooth, MMOTU ultrasound, and ISIC 2018 dermoscopy datasets, PMFSNet achieves competitive IoU scores with far fewer parameters (under 1 million) and lower FLOPs than many state-of-the-art transformers and CNNs, enabling deployment on edge devices. Ablation studies confirm the PMFS block’s effectiveness and its plug-and-play potential for other UNet-based architectures, underscoring a practical pathway toward efficient, scalable medical image segmentation.

Abstract

Current state-of-the-art medical image segmentation methods prioritize accuracy but often at the expense of increased computational demands and larger model sizes. Applying these large-scale models to the relatively limited scale of medical image datasets tends to induce redundant computation, complicating the process without the necessary benefits. This approach not only adds complexity but also presents challenges for the integration and deployment of lightweight models on edge devices. For instance, recent transformer-based models have excelled in 2D and 3D medical image segmentation due to their extensive receptive fields and high parameter count. However, their effectiveness comes with a risk of overfitting when applied to small datasets and often neglects the vital inductive biases of Convolutional Neural Networks (CNNs), essential for local feature representation. In this work, we propose PMFSNet, a novel medical imaging segmentation model that effectively balances global and local feature processing while avoiding the computational redundancy typical in larger models. PMFSNet streamlines the UNet-based hierarchical structure and simplifies the self-attention mechanism's computational complexity, making it suitable for lightweight applications. It incorporates a plug-and-play PMFS block, a multi-scale feature enhancement module based on attention mechanisms, to capture long-term dependencies. Extensive comprehensive results demonstrate that even with a model (less than 1 million parameters), our method achieves superior performance in various segmentation tasks across different data scales. It achieves (IoU) metrics of 84.68%, 82.02%, and 78.82% on public datasets of teeth CT (CBCT), ovarian tumors ultrasound(MMOTU), and skin lesions dermoscopy images (ISIC 2018), respectively. The source code is available at https://github.com/yykzjh/PMFSNet.
Paper Structure (28 sections, 15 equations, 10 figures, 8 tables)

This paper contains 28 sections, 15 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Params-FLOPs-IoU correlation comparison on the 3D CBCT tooth dataset. The Y-axis corresponds to the Intersection over Union (IoU) (higher is better), the X-axis corresponds to the Floating-point Operations Per Second (FLOPs) (lower is better), and the size of the circle corresponds to the Parameters (Params) (smaller is better). PMFSNet (ours) has achieved the best results in balance segmentation performance, model parameters count, and computational complexity.
  • Figure 2: The samples from different datasets. Sub-figure (a) shows some challenging samples of the 3D CBCT tooth dataset, such as missing teeth, metal artifacts, and incomplete views. Sub-figure (b) shows some challenging samples of the MMOTU dataset, such as inconspicuous lesion regions, blurred lesion boundaries, and low-contrast samples. Sub-figure (c) shows some challenging samples of the ISIC 2018 dataset, such as blurred samples, irregular lesion boundaries, and occluded lesions.
  • Figure 3: Overview of Polarized Multi-scale Feature Self-attention Network (PMFSNet) architecture. The input medical images are fed into an encoder with 3 stages. Then, the PMFS block enhances the features of the network's bottleneck using features at different scales. Finally, the skip connections are fused with an optional decoder, which sequentially incorporates the global contextual features into the enhanced bottleneck features by CNN-based up-sampling, gradually restoring them to the same resolution as the input image.
  • Figure 4: The Adaptive Multi-branch Feature Fusion (AMFF) layer. In one case, the resolutions of $X_1, X_2, X_3$ are $36\times80\times80\times48$, $64\times40\times40\times24$, $104\times20\times20\times12$. Downsampling and channel scaling unify resolution to $48\times20\times20\times12$, respectively. The resolution of fusion feature A is $144\times20\times20\times12$, which is obtained by concatenating the multi-branch features.
  • Figure 5: The Polarized Multi-scale Channel Self-attention (PMCS) module. In one case, the resolution of the input feature map is $C \times H \times W \times D (144 \times 20 \times 20 \times 12)$, where channel $C (48 + 48 + 48)$ is concatenated by three branches, whose channels are unified to 48. The depthwise separable convolution block is utilized to further decrease computational complexity.
  • ...and 5 more figures