Table of Contents
Fetching ...

HiFiSeg: High-Frequency Information Enhanced Polyp Segmentation with Global-Local Vision Transformer

Jingjing Ren, Xiaoyong Zhang, Lina Zhang

TL;DR

HiFiSeg is proposed, a novel network for colon polyp segmentation that enhances high-frequency information processing through a global-local vision transformer framework that leverages the pyramid vision transformer (PVT) as its encoder and introduces two key modules: the global-local interaction module (GLIM) and the selective aggregation module (SAM).

Abstract

Numerous studies have demonstrated the strong performance of Vision Transformer (ViT)-based methods across various computer vision tasks. However, ViT models often struggle to effectively capture high-frequency components in images, which are crucial for detecting small targets and preserving edge details, especially in complex scenarios. This limitation is particularly challenging in colon polyp segmentation, where polyps exhibit significant variability in structure, texture, and shape. High-frequency information, such as boundary details, is essential for achieving precise semantic segmentation in this context. To address these challenges, we propose HiFiSeg, a novel network for colon polyp segmentation that enhances high-frequency information processing through a global-local vision transformer framework. HiFiSeg leverages the pyramid vision transformer (PVT) as its encoder and introduces two key modules: the global-local interaction module (GLIM) and the selective aggregation module (SAM). GLIM employs a parallel structure to fuse global and local information at multiple scales, effectively capturing fine-grained features. SAM selectively integrates boundary details from low-level features with semantic information from high-level features, significantly improving the model's ability to accurately detect and segment polyps. Extensive experiments on five widely recognized benchmark datasets demonstrate the effectiveness of HiFiSeg for polyp segmentation. Notably, the mDice scores on the challenging CVC-ColonDB and ETIS datasets reached 0.826 and 0.822, respectively, underscoring the superior performance of HiFiSeg in handling the specific complexities of this task.

HiFiSeg: High-Frequency Information Enhanced Polyp Segmentation with Global-Local Vision Transformer

TL;DR

HiFiSeg is proposed, a novel network for colon polyp segmentation that enhances high-frequency information processing through a global-local vision transformer framework that leverages the pyramid vision transformer (PVT) as its encoder and introduces two key modules: the global-local interaction module (GLIM) and the selective aggregation module (SAM).

Abstract

Numerous studies have demonstrated the strong performance of Vision Transformer (ViT)-based methods across various computer vision tasks. However, ViT models often struggle to effectively capture high-frequency components in images, which are crucial for detecting small targets and preserving edge details, especially in complex scenarios. This limitation is particularly challenging in colon polyp segmentation, where polyps exhibit significant variability in structure, texture, and shape. High-frequency information, such as boundary details, is essential for achieving precise semantic segmentation in this context. To address these challenges, we propose HiFiSeg, a novel network for colon polyp segmentation that enhances high-frequency information processing through a global-local vision transformer framework. HiFiSeg leverages the pyramid vision transformer (PVT) as its encoder and introduces two key modules: the global-local interaction module (GLIM) and the selective aggregation module (SAM). GLIM employs a parallel structure to fuse global and local information at multiple scales, effectively capturing fine-grained features. SAM selectively integrates boundary details from low-level features with semantic information from high-level features, significantly improving the model's ability to accurately detect and segment polyps. Extensive experiments on five widely recognized benchmark datasets demonstrate the effectiveness of HiFiSeg for polyp segmentation. Notably, the mDice scores on the challenging CVC-ColonDB and ETIS datasets reached 0.826 and 0.822, respectively, underscoring the superior performance of HiFiSeg in handling the specific complexities of this task.
Paper Structure (20 sections, 6 equations, 5 figures, 3 tables)

This paper contains 20 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Examples of segmentation of the PraNet model for different challenge cases.
  • Figure 2: The overall architecture of HiFSeg network. The entire model contains three components: (a) pyramid vision transformer (PVT) as encoder; (b) pyramid global-local interaction module(GLIM) for fusing multi-level features; (c) selective aggregation module(SAM) for integrating the high- and low-level features selectively for the final output.
  • Figure 3: Details of the introduced global-local interaction module(GLIM).It consists of three convolutional branches and a global average pooling branch.
  • Figure 4: Qualitative results comparison of different models.
  • Figure 5: Visualization of the ablation study results.