SSTFB: Leveraging self-supervised pretext learning and temporal self-attention with feature branching for real-time video polyp segmentation
Ziang Xu, Jens Rittscher, Sharib Ali
TL;DR
Real-time video polyp segmentation is essential for colorectal cancer prevention during colonoscopy, but real-world videos suffer from motion artefacts and variability that degrade performance. The authors present SSTFB, a self-supervised and temporally aware network that uses normalised self-attention and feature branching to learn robust spatiotemporal polyp representations in an end-to-end framework. By combining a PIRL-based pretext task with a patch-level discriminator loss, a global-to-local attention scheme, and a two-stage UNet-like decoder, SSTFB achieves state-of-the-art results across seen, unseen, and out-of-centre video datasets while maintaining real-time speed. The approach demonstrates strong generalisation and practical potential for improving polyp detection and segmentation in clinical video streams.
Abstract
Polyps are early cancer indicators, so assessing occurrences of polyps and their removal is critical. They are observed through a colonoscopy screening procedure that generates a stream of video frames. Segmenting polyps in their natural video screening procedure has several challenges, such as the co-existence of imaging artefacts, motion blur, and floating debris. Most existing polyp segmentation algorithms are developed on curated still image datasets that do not represent real-world colonoscopy. Their performance often degrades on video data. We propose a video polyp segmentation method that performs self-supervised learning as an auxiliary task and a spatial-temporal self-attention mechanism for improved representation learning. Our end-to-end configuration and joint optimisation of losses enable the network to learn more discriminative contextual features in videos. Our experimental results demonstrate an improvement with respect to several state-of-the-art (SOTA) methods. Our ablation study also confirms that the choice of the proposed joint end-to-end training improves network accuracy by over 3% and nearly 10% on both the Dice similarity coefficient and intersection-over-union compared to the recently proposed method PNS+ and Polyp-PVT, respectively. Results on previously unseen video data indicate that the proposed method generalises.
