Table of Contents
Fetching ...

RTA-Former: Reverse Transformer Attention for Polyp Segmentation

Zhikai Li, Murong Yi, Ali Uneri, Sihan Niu, Craig Jones

TL;DR

This work tackles the challenge of precise edge segmentation in polyp segmentation by integrating a Transformer Encoder based on PVT with a novel Reverse Transformer Attention (RTA) module in the decoder. The proposed RTA-Former uses a Hierarchical Feature Synthesizer to fuse multi-scale transformer features and applies a reverse attention mechanism to emphasize difficult edge regions, yielding state-of-the-art results on five polyp datasets. Across multiple backbone sizes, the method demonstrates strong learning and generalization, outperforming CNN-based approaches and indicating practical potential for clinical decision support. The approach also provides flexibility in backbone size to balance accuracy and computational cost, with public code for reproducibility.

Abstract

Polyp segmentation is a key aspect of colorectal cancer prevention, enabling early detection and guiding subsequent treatments. Intelligent diagnostic tools, including deep learning solutions, are widely explored to streamline and potentially automate this process. However, even with many powerful network architectures, there still comes the problem of producing accurate edge segmentation. In this paper, we introduce a novel network, namely RTA-Former, that employs a transformer model as the encoder backbone and innovatively adapts Reverse Attention (RA) with a transformer stage in the decoder for enhanced edge segmentation. The results of the experiments illustrate that RTA-Former achieves state-of-the-art (SOTA) performance in five polyp segmentation datasets. The strong capability of RTA-Former holds promise in improving the accuracy of Transformer-based polyp segmentation, potentially leading to better clinical decisions and patient outcomes. Our code is publicly available on GitHub.

RTA-Former: Reverse Transformer Attention for Polyp Segmentation

TL;DR

This work tackles the challenge of precise edge segmentation in polyp segmentation by integrating a Transformer Encoder based on PVT with a novel Reverse Transformer Attention (RTA) module in the decoder. The proposed RTA-Former uses a Hierarchical Feature Synthesizer to fuse multi-scale transformer features and applies a reverse attention mechanism to emphasize difficult edge regions, yielding state-of-the-art results on five polyp datasets. Across multiple backbone sizes, the method demonstrates strong learning and generalization, outperforming CNN-based approaches and indicating practical potential for clinical decision support. The approach also provides flexibility in backbone size to balance accuracy and computational cost, with public code for reproducibility.

Abstract

Polyp segmentation is a key aspect of colorectal cancer prevention, enabling early detection and guiding subsequent treatments. Intelligent diagnostic tools, including deep learning solutions, are widely explored to streamline and potentially automate this process. However, even with many powerful network architectures, there still comes the problem of producing accurate edge segmentation. In this paper, we introduce a novel network, namely RTA-Former, that employs a transformer model as the encoder backbone and innovatively adapts Reverse Attention (RA) with a transformer stage in the decoder for enhanced edge segmentation. The results of the experiments illustrate that RTA-Former achieves state-of-the-art (SOTA) performance in five polyp segmentation datasets. The strong capability of RTA-Former holds promise in improving the accuracy of Transformer-based polyp segmentation, potentially leading to better clinical decisions and patient outcomes. Our code is publicly available on GitHub.
Paper Structure (16 sections, 4 equations, 4 figures, 3 tables)

This paper contains 16 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: An overview of RTA-Former architecture. The upper section showcases the overall architecture of the RTA-Former model, which is composed of an Encoder, a Hierarchical Feature Synthesizer, and a Decoder. The lower section offers an in-depth view of the internal structure of our Hierarchical Feature Synthesizer.
  • Figure 2: Structure of Reverse Transformer Attention (RTA)
  • Figure 3: Visualization comparison of polyp segmentation results for our model and other models on polyps of varying scales. GT refers to the ground truth of the dataset annotations. The last four columns show the prediction masks generated by the models.
  • Figure 4: Visualization of our attention module. Bottleneck 1.0 to Bottleneck 1.2 are the feature maps before the reverse mechanism. Bottleneck 2.0 to Bottleneck 2.2 are the feature maps after the reverse mechanism.