Semantic Parsing of Colonoscopy Videos with Multi-Label Temporal Networks

Ori Kelner; Or Weinstein; Ehud Rivlin; Roman Goldenberg

Semantic Parsing of Colonoscopy Videos with Multi-Label Temporal Networks

Ori Kelner, Or Weinstein, Ehud Rivlin, Roman Goldenberg

TL;DR

This work tackles semantic parsing of colonoscopy videos to automatically identify phases, landmarks, and tools for improved quality metrics and reporting. It develops a two-stage, multi-label temporal framework that extends MS-TCN to handle non-mutually exclusive labels, augmented by key-frame training and pseudo-labeling to boost the frame encoder. Temporal smoothing and a simple consistency filter enhance pseudo-label quality, while a multi-label MS-TCN refinement enables cross-label information sharing, achieving high per-frame accuracy on a large, multi-center dataset. The approach enables downstream tasks such as automatic report generation, video retrieval, and refined quality assessments, with future work extending to more colon segments and imaging modes.

Abstract

Following the successful debut of polyp detection and characterization, more advanced automation tools are being developed for colonoscopy. The new automation tasks, such as quality metrics or report generation, require understanding of the procedure flow that includes activities, events, anatomical landmarks, etc. In this work we present a method for automatic semantic parsing of colonoscopy videos. The method uses a novel DL multi-label temporal segmentation model trained in supervised and unsupervised regimes. We evaluate the accuracy of the method on a test set of over 300 annotated colonoscopy videos, and use ablation to explore the relative importance of various method's components.

Semantic Parsing of Colonoscopy Videos with Multi-Label Temporal Networks

TL;DR

Abstract

Paper Structure (13 sections, 7 equations, 6 figures, 5 tables)

This paper contains 13 sections, 7 equations, 6 figures, 5 tables.

Introduction
Methods
Baseline Method
Training Single-Frame Encoder with Key Frames
Pseudo Labels
Initial Supervised Model
Pseudo Labeling and Temporal Smoothing
Temporal Consistency Filtering
Multi-Label Temporal Network
Experiments and Results
Dataset
Accuracy Evaluation and Ablation Study
Conclusions and Future Work

Figures (6)

Figure 1: Colon Anatomy (from https://upload.wikimedia.org/wikipedia/commons/2/2d/Blausen_0604_LargeIntestine2.png)
Figure 2: The two stage video parsing pipeline. The first stage is a single frame encoder. The second stage runs temporal convolution (MS-TCN, ASFormer) on frame embeddings to yield per-frame classifications.
Figure 3: Left: A non-informative frame with a blocked field of view. Right: A key frame with a clear view of the triradiate fold.
Figure 4: Pre-training of the feature extractor. We use a combination of labeled data, together with pseudo-labels as explained in Section \ref{['pseudo']}. After the training is complete, we discard the classification heads and use the feature extractor to embed frames for the temporal network.
Figure 5: Multi-Label MS-TCN with two stages (the number of stages is a hyperparameter). Note that we apply the Softmax activation separately on the logits that correspond to the colon-segments, inside/outside and tools/no-tools.
...and 1 more figures

Semantic Parsing of Colonoscopy Videos with Multi-Label Temporal Networks

TL;DR

Abstract

Semantic Parsing of Colonoscopy Videos with Multi-Label Temporal Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (6)