SALI: Short-term Alignment and Long-term Interaction Network for Colonoscopy Video Polyp Segmentation

Qiang Hu; Zhenyu Yi; Ying Zhou; Fang Peng; Mei Liu; Qiang Li; Zhiwei Wang

SALI: Short-term Alignment and Long-term Interaction Network for Colonoscopy Video Polyp Segmentation

Qiang Hu, Zhenyu Yi, Ying Zhou, Fang Peng, Mei Liu, Qiang Li, Zhiwei Wang

TL;DR

This paper tackles automatic polyp segmentation in colonoscopy videos, where rapid camera motion and frequent low-quality frames degrade performance. It introduces SALI, a hybrid framework combining a Short-term Alignment Module (SAM) and a Long-term Interaction Module (LIM) to capture both short-range stability and long-range reliability. SAM performs spatial alignment of adjacent frames via deformable convolution and self-attention to produce stable short-term features, while LIM maintains a memory bank and uses a masked-attention mechanism to interact with historical cues for robust long-term representations. On SUN-SEG, SALI achieves state-of-the-art Dice scores across seen and unseen sub-sets, with consistent improvements over prior methods, demonstrating strong robustness and potential clinical impact for CRC diagnosis.

Abstract

Colonoscopy videos provide richer information in polyp segmentation for rectal cancer diagnosis. However, the endoscope's fast moving and close-up observing make the current methods suffer from large spatial incoherence and continuous low-quality frames, and thus yield limited segmentation accuracy. In this context, we focus on robust video polyp segmentation by enhancing the adjacent feature consistency and rebuilding the reliable polyp representation. To achieve this goal, we in this paper propose SALI network, a hybrid of Short-term Alignment Module (SAM) and Long-term Interaction Module (LIM). The SAM learns spatial-aligned features of adjacent frames via deformable convolution and further harmonizes them to capture more stable short-term polyp representation. In case of low-quality frames, the LIM stores the historical polyp representations as a long-term memory bank, and explores the retrospective relations to interactively rebuild more reliable polyp features for the current segmentation. Combing SAM and LIM, the SALI network of video segmentation shows a great robustness to the spatial variations and low-visual cues. Benchmark on the large-scale SUNSEG verifies the superiority of SALI over the current state-of-the-arts by improving Dice by 2.1%, 2.5%, 4.1% and 1.9%, for the four test sub-sets, respectively. Codes are at https://github.com/Scatteredrain/SALI.

SALI: Short-term Alignment and Long-term Interaction Network for Colonoscopy Video Polyp Segmentation

TL;DR

Abstract

Paper Structure (12 sections, 4 equations, 3 figures, 2 tables)

This paper contains 12 sections, 4 equations, 3 figures, 2 tables.

Introduction
Method
Stable Feature Learning via Short-term Alignment Module
Reliable Feature Learning via Long-term Interaction Module
Segmentation and Training details
Experiments
Datasets and Evaluation Metrics.
Comparisons with State-of-the-art Methods
Quantitative Comparisons.
Qualitative Comparisons.
Ablation Study
Conclusion

Figures (3)

Figure 1: Three challenges in polyp video segmentation. (a) The optical flow map (predicted by RAFT teed2020raft) can not show any object motion information. (b) Significant variations between two adjacent frames. (c) The long sequence of consecutive low-quality frames. Green arrows point to the polyps.
Figure 2: Overview of our proposed SALI. (a) SALI proposes two modules, called short-term alignment module (SAM) and long-term interaction module (LIM), to obtain the stable and reliable spatia-temporal features. (b) SAM first aligns the adjacent features, and then constructs stable short-term features by exploring relevance. (c) LIM utilize masked-attention (MA) block to interact the short-term feature with the long-term visual cues in the memory bank to obtain reliable long-term feature.
Figure 3: The visualization results of different methods on two challenge cases. The upper case: significant variations between adjacent frames; the lower case: a sequence of consecutive low-quality frames.

SALI: Short-term Alignment and Long-term Interaction Network for Colonoscopy Video Polyp Segmentation

TL;DR

Abstract

SALI: Short-term Alignment and Long-term Interaction Network for Colonoscopy Video Polyp Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)