Table of Contents
Fetching ...

FocusNet: Transformer-enhanced Polyp Segmentation with Local and Pooling Attention

Jun Zeng, KC Santosh, Deepak Rajan Nayak, Thomas de Lange, Jonas Varkey, Tyler Berzin, Debesh Jha

TL;DR

This work tackles the challenge of robust polyp segmentation across multiple imaging modalities and clinical centers. It introduces FocusNet, a Transformer-enhanced architecture built on a PVTv2 backbone, equipped with three novel modules: CIDM for cross-semantic feature fusion, DEM for detail refinement via deformable convolutions, and FAM for balancing local detail with global context through dual-path attention. On the PolypDB dataset, FocusNet achieves state-of-the-art performance across five modalities and three centers, as evidenced by higher mIoU and mDSC scores, while providing comprehensive ablation and qualitative analyses. The approach promises practical impact for real-world CRC screening by enabling accurate, multi-modal polyp segmentation and lays groundwork for real-time, multi-modal CDSS applications.

Abstract

Colonoscopy is vital in the early diagnosis of colorectal polyps. Regular screenings can effectively prevent benign polyps from progressing to CRC. While deep learning has made impressive strides in polyp segmentation, most existing models are trained on single-modality and single-center data, making them less effective in real-world clinical environments. To overcome these limitations, we propose FocusNet, a Transformer-enhanced focus attention network designed to improve polyp segmentation. FocusNet incorporates three essential modules: the Cross-semantic Interaction Decoder Module (CIDM) for generating coarse segmentation maps, the Detail Enhancement Module (DEM) for refining shallow features, and the Focus Attention Module (FAM), to balance local detail and global context through local and pooling attention mechanisms. We evaluate our model on PolypDB, a newly introduced dataset with multi-modality and multi-center data for building more reliable segmentation methods. Extensive experiments showed that FocusNet consistently outperforms existing state-of-the-art approaches with a high dice coefficients of 82.47% on the BLI modality, 88.46% on FICE, 92.04% on LCI, 82.09% on the NBI and 93.42% on WLI modality, demonstrating its accuracy and robustness across five different modalities. The source code for FocusNet is available at https://github.com/JunZengz/FocusNet.

FocusNet: Transformer-enhanced Polyp Segmentation with Local and Pooling Attention

TL;DR

This work tackles the challenge of robust polyp segmentation across multiple imaging modalities and clinical centers. It introduces FocusNet, a Transformer-enhanced architecture built on a PVTv2 backbone, equipped with three novel modules: CIDM for cross-semantic feature fusion, DEM for detail refinement via deformable convolutions, and FAM for balancing local detail with global context through dual-path attention. On the PolypDB dataset, FocusNet achieves state-of-the-art performance across five modalities and three centers, as evidenced by higher mIoU and mDSC scores, while providing comprehensive ablation and qualitative analyses. The approach promises practical impact for real-world CRC screening by enabling accurate, multi-modal polyp segmentation and lays groundwork for real-time, multi-modal CDSS applications.

Abstract

Colonoscopy is vital in the early diagnosis of colorectal polyps. Regular screenings can effectively prevent benign polyps from progressing to CRC. While deep learning has made impressive strides in polyp segmentation, most existing models are trained on single-modality and single-center data, making them less effective in real-world clinical environments. To overcome these limitations, we propose FocusNet, a Transformer-enhanced focus attention network designed to improve polyp segmentation. FocusNet incorporates three essential modules: the Cross-semantic Interaction Decoder Module (CIDM) for generating coarse segmentation maps, the Detail Enhancement Module (DEM) for refining shallow features, and the Focus Attention Module (FAM), to balance local detail and global context through local and pooling attention mechanisms. We evaluate our model on PolypDB, a newly introduced dataset with multi-modality and multi-center data for building more reliable segmentation methods. Extensive experiments showed that FocusNet consistently outperforms existing state-of-the-art approaches with a high dice coefficients of 82.47% on the BLI modality, 88.46% on FICE, 92.04% on LCI, 82.09% on the NBI and 93.42% on WLI modality, demonstrating its accuracy and robustness across five different modalities. The source code for FocusNet is available at https://github.com/JunZengz/FocusNet.

Paper Structure

This paper contains 15 sections, 10 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of the proposed FocusNet architecture. The network comprises four main components: (1) a PVTv2 wang2022pvt backbone that extracts multi-scale hierarchical features from the input image; (2) a Detail Enhancement Module (DEM) that refines shallow features ($f_1$) to preserve fine-grained boundary information; (3) two Cross-semantic Interaction Decoder Modules (CIDM-M and CIDM-A) that process deep features ($f_2$, $f_3$, $f_4$) to generate coarse segmentation maps by enhancing semantic fusion across multiple levels; and (4) a Focus Attention Module (FAM) that combines outputs from DEM and CIDMs using both local and pooling attention to capture detailed spatial features and global context. The FAM produces segmentation maps ($P_3$, $P_4$), which are fused with coarse predictions ($P_1$, $P_2$) to generate the final segmentation output ($\Hat{P}$) for accurate and robust polyp segmentation.
  • Figure 2: Samples from various modalities in the PolypDB dataset are shown in each row, with the corresponding modality descriptions provided on the left of each row.
  • Figure 3: Qualitative results of different methods across BLI and NBI modalities. It is obvious that FocusNet produces better segmentation masks, particularly in challenging scenarios involving small, flat, multiple polyps. Models such as RMAMamba-T showed under-segmentation whereas our model provided accurate and robust segmentation.
  • Figure 4: Qualitative results of different methods across FICE, LCI, and WLI modalities. Here, most of the advanced models (PVT-Cascade and RMAMamba-T) can detect polyps, however, FocusNet produces segmentation masks with higher boundary precision and consistency.
  • Figure 5: Visualization of intermediate feature maps from our model. Columns from left to right: input image, ground truth mask (GT), predicted mask (Ours), encoder feature maps ($f_1$, $f_2$), detail-enhanced transformer map ($T'$), and decoder outputs ($\hat{F}_1$, $\hat{F}_2$). Here, the shallow feature $f_1$ preserves fine-grained details, while $f_2$ captures higher-level semantic context. The enhanced map $T'$ refines boundary structures, and the decoder outputs integrate local and global cues for precise segmentation. The progression illustrates how FocusNet effectively combines detail, context, and attention to delineate polyps.
  • ...and 1 more figures