Table of Contents
Fetching ...

Transformer-Enhanced Iterative Feedback Mechanism for Polyp Segmentation

Nikhil Kumar Tomar, Debesh Jha, Koushik Biswas, Tyler M. Berzin, Rajesh Keswani, Michael Wallace, Ulas Bagci

TL;DR

This work tackles automated polyp segmentation in colonoscopy images, a task hindered by variable polyp appearance and endoscopist miss rates. It introduces FANetv2, an encoder–decoder architecture that combines a Pyramid Vision Transformer backbone, a Feature Enhancement Dilated block, iterative feedback attention, and text-guided attention to refine segmentation from an initial Otsu mask while also performing auxiliary polyp attribute classification. The model achieves state-of-the-art performance on BKAI-IGH and CVC-ClinicDB, with DSC values of up to 0.9186 and 0.9481 and low Hausdorff distances, outperforming multiple transformers-based baselines. The results suggest FANetv2’s iterative refinement and contextual text cues can robustly handle polyps of varying sizes and counts across imaging modalities, indicating potential for real-time clinical support in CRC screening.

Abstract

Colorectal cancer (CRC) is the third most common cause of cancer diagnosed in the United States and the second leading cause of cancer-related death among both genders. Notably, CRC is the leading cause of cancer in younger men less than 50 years old. Colonoscopy is considered the gold standard for the early diagnosis of CRC. Skills vary significantly among endoscopists, and a high miss rate is reported. Automated polyp segmentation can reduce the missed rates, and timely treatment is possible in the early stage. To address this challenge, we introduce \textit{\textbf{\ac{FANetv2}}}, an advanced encoder-decoder network designed to accurately segment polyps from colonoscopy images. Leveraging an initial input mask generated by Otsu thresholding, FANetv2 iteratively refines its binary segmentation masks through a novel feedback attention mechanism informed by the mask predictions of previous epochs. Additionally, it employs a text-guided approach that integrates essential information about the number (one or many) and size (small, medium, large) of polyps to further enhance its feature representation capabilities. This dual-task approach facilitates accurate polyp segmentation and aids in the auxiliary classification of polyp attributes, significantly boosting the model's performance. Our comprehensive evaluations on the publicly available BKAI-IGH and CVC-ClinicDB datasets demonstrate the superior performance of FANetv2, evidenced by high dice similarity coefficients (DSC) of 0.9186 and 0.9481, along with low Hausdorff distances of 2.83 and 3.19, respectively. The source code for FANetv2 is available at https://github.com/xxxxx/FANetv2.

Transformer-Enhanced Iterative Feedback Mechanism for Polyp Segmentation

TL;DR

This work tackles automated polyp segmentation in colonoscopy images, a task hindered by variable polyp appearance and endoscopist miss rates. It introduces FANetv2, an encoder–decoder architecture that combines a Pyramid Vision Transformer backbone, a Feature Enhancement Dilated block, iterative feedback attention, and text-guided attention to refine segmentation from an initial Otsu mask while also performing auxiliary polyp attribute classification. The model achieves state-of-the-art performance on BKAI-IGH and CVC-ClinicDB, with DSC values of up to 0.9186 and 0.9481 and low Hausdorff distances, outperforming multiple transformers-based baselines. The results suggest FANetv2’s iterative refinement and contextual text cues can robustly handle polyps of varying sizes and counts across imaging modalities, indicating potential for real-time clinical support in CRC screening.

Abstract

Colorectal cancer (CRC) is the third most common cause of cancer diagnosed in the United States and the second leading cause of cancer-related death among both genders. Notably, CRC is the leading cause of cancer in younger men less than 50 years old. Colonoscopy is considered the gold standard for the early diagnosis of CRC. Skills vary significantly among endoscopists, and a high miss rate is reported. Automated polyp segmentation can reduce the missed rates, and timely treatment is possible in the early stage. To address this challenge, we introduce \textit{\textbf{\ac{FANetv2}}}, an advanced encoder-decoder network designed to accurately segment polyps from colonoscopy images. Leveraging an initial input mask generated by Otsu thresholding, FANetv2 iteratively refines its binary segmentation masks through a novel feedback attention mechanism informed by the mask predictions of previous epochs. Additionally, it employs a text-guided approach that integrates essential information about the number (one or many) and size (small, medium, large) of polyps to further enhance its feature representation capabilities. This dual-task approach facilitates accurate polyp segmentation and aids in the auxiliary classification of polyp attributes, significantly boosting the model's performance. Our comprehensive evaluations on the publicly available BKAI-IGH and CVC-ClinicDB datasets demonstrate the superior performance of FANetv2, evidenced by high dice similarity coefficients (DSC) of 0.9186 and 0.9481, along with low Hausdorff distances of 2.83 and 3.19, respectively. The source code for FANetv2 is available at https://github.com/xxxxx/FANetv2.
Paper Structure (16 sections, 1 equation, 2 figures, 3 tables)

This paper contains 16 sections, 1 equation, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overview of the FANetv2 architecture. FANetv2 takes two inputs: a polyp image and an initial input mask generated from Otsu thresholding. The method is designed to perform two key tasks: an auxiliary polyp attribute classification and the polyp segmentation. The input image is initially fed to the encoder, which forwards its output to the Feature Enhancement Block, whose output is used for polyp attribute classification and the rest of the network for polyp segmentation. An innovative aspect of FANetv2 is that it uses an initial input mask and the polyp attributes to generate a unified feature representation, which is then passed to the decoder to predict the final segmentation mask. It is to be noted that FANetv2 has two key mechanisms: a feedback attention mechanism that leverages input mask from the previous epoch to guide the proposed network to refine segmentation and a text-guided mechanism that incorporates crucial information about the polyp, such as the number and size of polyps present within an image. These components work together to enhance the feature representation, thus improving the overall performance of the proposed FANetv2.
  • Figure 2: Qualitative results comparison on BKAI-IGH lan2021neounet and CVC-ClinicDB bernal2015wm. The heatmap shows the FANetv2's precision in accurately predicting polyps of various sizes and shapes.