Table of Contents
Fetching ...

SSTFB: Leveraging self-supervised pretext learning and temporal self-attention with feature branching for real-time video polyp segmentation

Ziang Xu, Jens Rittscher, Sharib Ali

TL;DR

Real-time video polyp segmentation is essential for colorectal cancer prevention during colonoscopy, but real-world videos suffer from motion artefacts and variability that degrade performance. The authors present SSTFB, a self-supervised and temporally aware network that uses normalised self-attention and feature branching to learn robust spatiotemporal polyp representations in an end-to-end framework. By combining a PIRL-based pretext task with a patch-level discriminator loss, a global-to-local attention scheme, and a two-stage UNet-like decoder, SSTFB achieves state-of-the-art results across seen, unseen, and out-of-centre video datasets while maintaining real-time speed. The approach demonstrates strong generalisation and practical potential for improving polyp detection and segmentation in clinical video streams.

Abstract

Polyps are early cancer indicators, so assessing occurrences of polyps and their removal is critical. They are observed through a colonoscopy screening procedure that generates a stream of video frames. Segmenting polyps in their natural video screening procedure has several challenges, such as the co-existence of imaging artefacts, motion blur, and floating debris. Most existing polyp segmentation algorithms are developed on curated still image datasets that do not represent real-world colonoscopy. Their performance often degrades on video data. We propose a video polyp segmentation method that performs self-supervised learning as an auxiliary task and a spatial-temporal self-attention mechanism for improved representation learning. Our end-to-end configuration and joint optimisation of losses enable the network to learn more discriminative contextual features in videos. Our experimental results demonstrate an improvement with respect to several state-of-the-art (SOTA) methods. Our ablation study also confirms that the choice of the proposed joint end-to-end training improves network accuracy by over 3% and nearly 10% on both the Dice similarity coefficient and intersection-over-union compared to the recently proposed method PNS+ and Polyp-PVT, respectively. Results on previously unseen video data indicate that the proposed method generalises.

SSTFB: Leveraging self-supervised pretext learning and temporal self-attention with feature branching for real-time video polyp segmentation

TL;DR

Real-time video polyp segmentation is essential for colorectal cancer prevention during colonoscopy, but real-world videos suffer from motion artefacts and variability that degrade performance. The authors present SSTFB, a self-supervised and temporally aware network that uses normalised self-attention and feature branching to learn robust spatiotemporal polyp representations in an end-to-end framework. By combining a PIRL-based pretext task with a patch-level discriminator loss, a global-to-local attention scheme, and a two-stage UNet-like decoder, SSTFB achieves state-of-the-art results across seen, unseen, and out-of-centre video datasets while maintaining real-time speed. The approach demonstrates strong generalisation and practical potential for improving polyp detection and segmentation in clinical video streams.

Abstract

Polyps are early cancer indicators, so assessing occurrences of polyps and their removal is critical. They are observed through a colonoscopy screening procedure that generates a stream of video frames. Segmenting polyps in their natural video screening procedure has several challenges, such as the co-existence of imaging artefacts, motion blur, and floating debris. Most existing polyp segmentation algorithms are developed on curated still image datasets that do not represent real-world colonoscopy. Their performance often degrades on video data. We propose a video polyp segmentation method that performs self-supervised learning as an auxiliary task and a spatial-temporal self-attention mechanism for improved representation learning. Our end-to-end configuration and joint optimisation of losses enable the network to learn more discriminative contextual features in videos. Our experimental results demonstrate an improvement with respect to several state-of-the-art (SOTA) methods. Our ablation study also confirms that the choice of the proposed joint end-to-end training improves network accuracy by over 3% and nearly 10% on both the Dice similarity coefficient and intersection-over-union compared to the recently proposed method PNS+ and Polyp-PVT, respectively. Results on previously unseen video data indicate that the proposed method generalises.
Paper Structure (34 sections, 6 equations, 5 figures, 8 tables)

This paper contains 34 sections, 6 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Proposed SSTFB network: Our network consists of three parts: a) Global and local encoders trained with a self-supervision loss comprising Jigsaw-puzzle sampling. b) High-level global and local features branch from both encoders and pass through a normalised self-attention block (NS-block ji2021progressively), enabling global-to-local feature learning. c) A decoder layer that fuses the self-attention high-level feature maps from NS-blocks and the low-level features from the local encoder for final mask prediction.
  • Figure 2: Normalised self-attention block (NS-block) used in our SSTFB network.
  • Figure 3: Qualitative results: Easy samples from SUN-SEG-Easy (unseen).
  • Figure 4: Qualitative results: Hard samples from SUN-SEG-Hard (unseen, top rows), and samples from unseen data centre CVC-612 (bottom rows).
  • Figure 5: Limitations of our proposed approach (SSTFB), including bubbles, instruments, and imaging artefacts such as specularity and pixel saturation.