Table of Contents
Fetching ...

Self-Supervised Polyp Re-Identification in Colonoscopy

Yotam Intrator, Natalie Aizenberg, Amir Livne, Ehud Rivlin, Roman Goldenberg

TL;DR

The paper tackles the challenge of long-term polyp tracking in colonoscopy to support CADx and automated reporting by introducing a self-supervised, appearance-based ReID framework. It combines an early-fusion, transformer-based multi-view tracklet encoder with a SimCLR-style contrastive objective, and also explores a single-frame representation, both trained without manual labels; positives derive from temporal views of the same polyp and pseudo-positives are created by splitting tracklets. The approach improves tracklet grouping, reducing fragmentation and enhancing CADx accuracy (e.g., AUROC up to $0.77$ for ReID, CADx AUC up to $0.90$ with ReID vs $0.86$ for tracking), approaching the performance attainable with manually annotated GT. These results demonstrate the practical impact of appearance-based ReID on data aggregation, reporting, and clinical metrics in colonoscopy, while acknowledging limitations when polyp appearance changes during procedures and suggesting broader applications in automated reporting and metric computation.

Abstract

Computer-aided polyp detection (CADe) is becoming a standard, integral part of any modern colonoscopy system. A typical colonoscopy CADe detects a polyp in a single frame and does not track it through the video sequence. Yet, many downstream tasks including polyp characterization (CADx), quality metrics, automatic reporting, require aggregating polyp data from multiple frames. In this work we propose a robust long term polyp tracking method based on re-identification by visual appearance. Our solution uses an attention-based self-supervised ML model, specifically designed to leverage the temporal nature of video input. We quantitatively evaluate method's performance and demonstrate its value for the CADx task.

Self-Supervised Polyp Re-Identification in Colonoscopy

TL;DR

The paper tackles the challenge of long-term polyp tracking in colonoscopy to support CADx and automated reporting by introducing a self-supervised, appearance-based ReID framework. It combines an early-fusion, transformer-based multi-view tracklet encoder with a SimCLR-style contrastive objective, and also explores a single-frame representation, both trained without manual labels; positives derive from temporal views of the same polyp and pseudo-positives are created by splitting tracklets. The approach improves tracklet grouping, reducing fragmentation and enhancing CADx accuracy (e.g., AUROC up to for ReID, CADx AUC up to with ReID vs for tracking), approaching the performance attainable with manually annotated GT. These results demonstrate the practical impact of appearance-based ReID on data aggregation, reporting, and clinical metrics in colonoscopy, while acknowledging limitations when polyp appearance changes during procedures and suggesting broader applications in automated reporting and metric computation.

Abstract

Computer-aided polyp detection (CADe) is becoming a standard, integral part of any modern colonoscopy system. A typical colonoscopy CADe detects a polyp in a single frame and does not track it through the video sequence. Yet, many downstream tasks including polyp characterization (CADx), quality metrics, automatic reporting, require aggregating polyp data from multiple frames. In this work we propose a robust long term polyp tracking method based on re-identification by visual appearance. Our solution uses an attention-based self-supervised ML model, specifically designed to leverage the temporal nature of video input. We quantitatively evaluate method's performance and demonstrate its value for the CADx task.
Paper Structure (14 sections, 1 equation, 6 figures, 4 tables)

This paper contains 14 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: (a) A polyp image, (b) two additional views of the polyp in (a) taken from the same tracklet, (c) two typical augmentations of the polyp in (a). Images in (b) offer more realistic variations, such as different texture, tools, etc.
  • Figure 2: Multi-view transformer encoder. Tracklet frames are passed through a single frame encoder to generate frame embedding. The embeddings then go through the transformer encoder, concatenated with the CLS token. Finally, the contextualized CLS token from the transformer encoder output goes through a projection head, resulting with the tracklet visual representation.
  • Figure 3: Top: raw tracklets detected by tracking by detection, bottom: combined tracklets after applying ReID. X-axis is the time[s], y-axis is the unique tracklet id. Each color change represent a different tracklet. Over 40 different tracklets were found in this procedure, and only 4 remain after applying ReID.
  • Figure 4: Frame-to-frame cosine similarity within a tracklet via the single frame encoder.
  • Figure 5: ROC and PRC plots of various ReID techniques with AUC and AUPRC respectively. The joint embedding method consistently outperforms the other methods.
  • ...and 1 more figures