SALI: Short-term Alignment and Long-term Interaction Network for Colonoscopy Video Polyp Segmentation
Qiang Hu, Zhenyu Yi, Ying Zhou, Fang Peng, Mei Liu, Qiang Li, Zhiwei Wang
TL;DR
This paper tackles automatic polyp segmentation in colonoscopy videos, where rapid camera motion and frequent low-quality frames degrade performance. It introduces SALI, a hybrid framework combining a Short-term Alignment Module (SAM) and a Long-term Interaction Module (LIM) to capture both short-range stability and long-range reliability. SAM performs spatial alignment of adjacent frames via deformable convolution and self-attention to produce stable short-term features, while LIM maintains a memory bank and uses a masked-attention mechanism to interact with historical cues for robust long-term representations. On SUN-SEG, SALI achieves state-of-the-art Dice scores across seen and unseen sub-sets, with consistent improvements over prior methods, demonstrating strong robustness and potential clinical impact for CRC diagnosis.
Abstract
Colonoscopy videos provide richer information in polyp segmentation for rectal cancer diagnosis. However, the endoscope's fast moving and close-up observing make the current methods suffer from large spatial incoherence and continuous low-quality frames, and thus yield limited segmentation accuracy. In this context, we focus on robust video polyp segmentation by enhancing the adjacent feature consistency and rebuilding the reliable polyp representation. To achieve this goal, we in this paper propose SALI network, a hybrid of Short-term Alignment Module (SAM) and Long-term Interaction Module (LIM). The SAM learns spatial-aligned features of adjacent frames via deformable convolution and further harmonizes them to capture more stable short-term polyp representation. In case of low-quality frames, the LIM stores the historical polyp representations as a long-term memory bank, and explores the retrospective relations to interactively rebuild more reliable polyp features for the current segmentation. Combing SAM and LIM, the SALI network of video segmentation shows a great robustness to the spatial variations and low-visual cues. Benchmark on the large-scale SUNSEG verifies the superiority of SALI over the current state-of-the-arts by improving Dice by 2.1%, 2.5%, 4.1% and 1.9%, for the four test sub-sets, respectively. Codes are at https://github.com/Scatteredrain/SALI.
