Table of Contents
Fetching ...

Multi-dimension Transformer with Attention-based Filtering for Medical Image Segmentation

Wentao Wang, Xi Xiao, Mingjie Liu, Qing Tian, Xuanyao Huang, Qizhen Lan, Swalpa Kumar Roy, Tianyang Wang

TL;DR

This work targets accurate medical image segmentation by addressing two core challenges: low signal-to-noise ratio and limited cross-dimension feature representation in transformers. It introduces MDT-AF, a multi-dimension transformer with an attention-based filtering patch embedding that implements a coarse-to-fine refinement, plus blocks that extend self-attention across spatial and channel dimensions with an interaction mechanism. The approach achieves state-of-the-art results across Lung X-ray, Skin Lesion, and Kvasir-SEG benchmarks, demonstrating improved boundary delineation and robustness to noise. The combination of refined patch embeddings and multi-dimension attention offers a practical pathway to robust, high-precision medical image segmentation across diverse imaging modalities.

Abstract

The accurate segmentation of medical images is crucial for diagnosing and treating diseases. Recent studies demonstrate that vision transformer-based methods have significantly improved performance in medical image segmentation, primarily due to their superior ability to establish global relationships among features and adaptability to various inputs. However, these methods struggle with the low signal-to-noise ratio inherent to medical images. Additionally, the effective utilization of channel and spatial information, which are essential for medical image segmentation, is limited by the representation capacity of self-attention. To address these challenges, we propose a multi-dimension transformer with attention-based filtering (MDT-AF), which redesigns the patch embedding and self-attention mechanism for medical image segmentation. MDT-AF incorporates an attention-based feature filtering mechanism into the patch embedding blocks and employs a coarse-to-fine process to mitigate the impact of low signal-to-noise ratio. To better capture complex structures in medical images, MDT-AF extends the self-attention mechanism to incorporate spatial and channel dimensions, enriching feature representation. Moreover, we introduce an interaction mechanism to improve the feature aggregation between spatial and channel dimensions. Experimental results on three public medical image segmentation benchmarks show that MDT-AF achieves state-of-the-art (SOTA) performance.

Multi-dimension Transformer with Attention-based Filtering for Medical Image Segmentation

TL;DR

This work targets accurate medical image segmentation by addressing two core challenges: low signal-to-noise ratio and limited cross-dimension feature representation in transformers. It introduces MDT-AF, a multi-dimension transformer with an attention-based filtering patch embedding that implements a coarse-to-fine refinement, plus blocks that extend self-attention across spatial and channel dimensions with an interaction mechanism. The approach achieves state-of-the-art results across Lung X-ray, Skin Lesion, and Kvasir-SEG benchmarks, demonstrating improved boundary delineation and robustness to noise. The combination of refined patch embeddings and multi-dimension attention offers a practical pathway to robust, high-precision medical image segmentation across diverse imaging modalities.

Abstract

The accurate segmentation of medical images is crucial for diagnosing and treating diseases. Recent studies demonstrate that vision transformer-based methods have significantly improved performance in medical image segmentation, primarily due to their superior ability to establish global relationships among features and adaptability to various inputs. However, these methods struggle with the low signal-to-noise ratio inherent to medical images. Additionally, the effective utilization of channel and spatial information, which are essential for medical image segmentation, is limited by the representation capacity of self-attention. To address these challenges, we propose a multi-dimension transformer with attention-based filtering (MDT-AF), which redesigns the patch embedding and self-attention mechanism for medical image segmentation. MDT-AF incorporates an attention-based feature filtering mechanism into the patch embedding blocks and employs a coarse-to-fine process to mitigate the impact of low signal-to-noise ratio. To better capture complex structures in medical images, MDT-AF extends the self-attention mechanism to incorporate spatial and channel dimensions, enriching feature representation. Moreover, we introduce an interaction mechanism to improve the feature aggregation between spatial and channel dimensions. Experimental results on three public medical image segmentation benchmarks show that MDT-AF achieves state-of-the-art (SOTA) performance.
Paper Structure (14 sections, 16 equations, 4 figures, 2 tables)

This paper contains 14 sections, 16 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The proposed MDT-AF framework comprises three primary modules: 1) Attention-based patch embedding blocks that generate patch tokens while concurrently producing attention weights. These weights are instrumental in filtering out coarse features and noise. 2) Multi-dimension transformer blocks extend self-attention across spatial and channel-wise dimensions to build and aggregate a comprehensive feature representation. 3) MLP decoders fuse these multi-level features to accurately predict the semantic segmentation mask. Where $C_{1}$ is set to 64, $C_{2}$ is set to 128, $C_{3}$ is set to 320, and $C_{4}$ is set to 512.
  • Figure 2: The proposed Patch Embedding with Attention-based Filtering consists of two parallel branches: 1) The Overlap Patch Embedding branch processes an input feature map to extract coarse features. 2) Simultaneously, a parallel branch generates corresponding attention weights. These weights are applied to the coarse features, filtering out noise and refining the feature representation from coarse to fine. Notably, this approach is consistently employed in the patch embedding block of each encoder stage to generate patch tokens.
  • Figure 3: The proposed Multi-dimension Transformer Block (b) expands self-attention to spatial and channel dimensions. Unlike transformer block (a), it incorporates feature interaction and aggregation within blocks, with spatial self-attention capturing contextual information across image positions and channel self-attention analyzing feature channel relationships to highlight significant features. And a convolution branch is parallel with self-attention to add locality. "SA" denotes self-attention, "ESA" signifies efficient self-attention, "SSA" stands for spatial self-attention, and "CSA" represents channel self-attention.
  • Figure 4: The visual comparison results are presented for the Lung X-ray, Skin Lesion, and Kvasir-SEG datasets. The images in the first row depict the results of skin lesion segmentation, the second row shows the outcomes of polyp segmentation, and the final row displays the results of lung segmentation.