Table of Contents
Fetching ...

PSTNet: Enhanced Polyp Segmentation with Multi-scale Alignment and Frequency Domain Integration

Wenhao Xu, Rongtao Xu, Changwei Wang, Xiuli Li, Shibiao Xu, Li Guo

TL;DR

PSTNet tackles the challenge of accurate colorectal polyp segmentation in colonoscopy images by integrating frequency-domain cues with RGB features. It introduces three modules—FCAM for frequency cues, FSAM for multi-scale feature alignment, and CPM for cross-perception fusion—built on a four-scale shunted Transformer encoder. The approach demonstrates state-of-the-art performance across five challenging datasets, with ablations confirming the contribution of each module and the loss design. This work advances computer-assisted CRC diagnosis by improving polyp localization, boundary precision, and generalization, while offering potential applicability to other medical imaging tasks.

Abstract

Accurate segmentation of colorectal polyps in colonoscopy images is crucial for effective diagnosis and management of colorectal cancer (CRC). However, current deep learning-based methods primarily rely on fusing RGB information across multiple scales, leading to limitations in accurately identifying polyps due to restricted RGB domain information and challenges in feature misalignment during multi-scale aggregation. To address these limitations, we propose the Polyp Segmentation Network with Shunted Transformer (PSTNet), a novel approach that integrates both RGB and frequency domain cues present in the images. PSTNet comprises three key modules: the Frequency Characterization Attention Module (FCAM) for extracting frequency cues and capturing polyp characteristics, the Feature Supplementary Alignment Module (FSAM) for aligning semantic information and reducing misalignment noise, and the Cross Perception localization Module (CPM) for synergizing frequency cues with high-level semantics to achieve efficient polyp segmentation. Extensive experiments on challenging datasets demonstrate PSTNet's significant improvement in polyp segmentation accuracy across various metrics, consistently outperforming state-of-the-art methods. The integration of frequency domain cues and the novel architectural design of PSTNet contribute to advancing computer-assisted polyp segmentation, facilitating more accurate diagnosis and management of CRC.

PSTNet: Enhanced Polyp Segmentation with Multi-scale Alignment and Frequency Domain Integration

TL;DR

PSTNet tackles the challenge of accurate colorectal polyp segmentation in colonoscopy images by integrating frequency-domain cues with RGB features. It introduces three modules—FCAM for frequency cues, FSAM for multi-scale feature alignment, and CPM for cross-perception fusion—built on a four-scale shunted Transformer encoder. The approach demonstrates state-of-the-art performance across five challenging datasets, with ablations confirming the contribution of each module and the loss design. This work advances computer-assisted CRC diagnosis by improving polyp localization, boundary precision, and generalization, while offering potential applicability to other medical imaging tasks.

Abstract

Accurate segmentation of colorectal polyps in colonoscopy images is crucial for effective diagnosis and management of colorectal cancer (CRC). However, current deep learning-based methods primarily rely on fusing RGB information across multiple scales, leading to limitations in accurately identifying polyps due to restricted RGB domain information and challenges in feature misalignment during multi-scale aggregation. To address these limitations, we propose the Polyp Segmentation Network with Shunted Transformer (PSTNet), a novel approach that integrates both RGB and frequency domain cues present in the images. PSTNet comprises three key modules: the Frequency Characterization Attention Module (FCAM) for extracting frequency cues and capturing polyp characteristics, the Feature Supplementary Alignment Module (FSAM) for aligning semantic information and reducing misalignment noise, and the Cross Perception localization Module (CPM) for synergizing frequency cues with high-level semantics to achieve efficient polyp segmentation. Extensive experiments on challenging datasets demonstrate PSTNet's significant improvement in polyp segmentation accuracy across various metrics, consistently outperforming state-of-the-art methods. The integration of frequency domain cues and the novel architectural design of PSTNet contribute to advancing computer-assisted polyp segmentation, facilitating more accurate diagnosis and management of CRC.
Paper Structure (30 sections, 13 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 30 sections, 13 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Our proposed PSTNet model has been comprehensively evaluated and compared with MSNetzhao2021automatic on a diverse set of challenging polyp images. These images include scenarios where the polyps are diminutive and easily overlooked (a), as well as situations where the segmentation boundaries are prone to errors due to blurred demarcations (b) and (c). The results of our experimental analyses demonstrate that PSTNet outperforms in terms of polyp localization capabilities and achieves higher segmentation accuracy.
  • Figure 2: The framework of our PSTNet, which includes the shunted transformer (ST)ren2022shunted (a) as an encoder network, (b) Feature Supplementary Alignment Module (FSAM) for fusing global semantic features,which contains three Feature Alignment (FA) units, (c) Frequency Characteristic Attention module (FCAM) for extracting low-level semantic features with frequency domain cues, and (d) Cross Perception localization Module (CPM) for linking frequency domain cues with global semantic features for the final output.
  • Figure 3: The details of the frequency characteristic attention module (FCAM). First, the input ${X}_{in}$ is cut, repeated and fused horizontally and vertically so that each spatial location obtains a feature response from a global context with the same horizontal and vertical coordinates. Secondly, we combined a 2D discrete cosine transform (${DCT}_{2D}$) to obtain spectral information, and finally, we used the resulting full attention affinity to re-weight each channel map.
  • Figure 4: The details of the Feature Alignment units. First, the high-level features $\mathbf{C}$ are upsampled and connected to the neighbouring low-level features $\mathbf{P}$. The two predicted biased feature maps are then obtained by deformable convolution in BSD, and the features are aligned separately for both scales, followed by a summation operation.
  • Figure 5: The details of the CP unit. The low-level features $R_{1}$ containing frequency domain information are aligned with the global features $R_{2}$ via the FA unit, then enhanced via the FCA unit, and finally added with the result of the subtraction of $|\mathbf{R}_{1}-\mathbf{R}_{2}|$.
  • ...and 3 more figures