FocusNet: Transformer-enhanced Polyp Segmentation with Local and Pooling Attention
Jun Zeng, KC Santosh, Deepak Rajan Nayak, Thomas de Lange, Jonas Varkey, Tyler Berzin, Debesh Jha
TL;DR
This work tackles the challenge of robust polyp segmentation across multiple imaging modalities and clinical centers. It introduces FocusNet, a Transformer-enhanced architecture built on a PVTv2 backbone, equipped with three novel modules: CIDM for cross-semantic feature fusion, DEM for detail refinement via deformable convolutions, and FAM for balancing local detail with global context through dual-path attention. On the PolypDB dataset, FocusNet achieves state-of-the-art performance across five modalities and three centers, as evidenced by higher mIoU and mDSC scores, while providing comprehensive ablation and qualitative analyses. The approach promises practical impact for real-world CRC screening by enabling accurate, multi-modal polyp segmentation and lays groundwork for real-time, multi-modal CDSS applications.
Abstract
Colonoscopy is vital in the early diagnosis of colorectal polyps. Regular screenings can effectively prevent benign polyps from progressing to CRC. While deep learning has made impressive strides in polyp segmentation, most existing models are trained on single-modality and single-center data, making them less effective in real-world clinical environments. To overcome these limitations, we propose FocusNet, a Transformer-enhanced focus attention network designed to improve polyp segmentation. FocusNet incorporates three essential modules: the Cross-semantic Interaction Decoder Module (CIDM) for generating coarse segmentation maps, the Detail Enhancement Module (DEM) for refining shallow features, and the Focus Attention Module (FAM), to balance local detail and global context through local and pooling attention mechanisms. We evaluate our model on PolypDB, a newly introduced dataset with multi-modality and multi-center data for building more reliable segmentation methods. Extensive experiments showed that FocusNet consistently outperforms existing state-of-the-art approaches with a high dice coefficients of 82.47% on the BLI modality, 88.46% on FICE, 92.04% on LCI, 82.09% on the NBI and 93.42% on WLI modality, demonstrating its accuracy and robustness across five different modalities. The source code for FocusNet is available at https://github.com/JunZengz/FocusNet.
