Semantic Parsing of Colonoscopy Videos with Multi-Label Temporal Networks
Ori Kelner, Or Weinstein, Ehud Rivlin, Roman Goldenberg
TL;DR
This work tackles semantic parsing of colonoscopy videos to automatically identify phases, landmarks, and tools for improved quality metrics and reporting. It develops a two-stage, multi-label temporal framework that extends MS-TCN to handle non-mutually exclusive labels, augmented by key-frame training and pseudo-labeling to boost the frame encoder. Temporal smoothing and a simple consistency filter enhance pseudo-label quality, while a multi-label MS-TCN refinement enables cross-label information sharing, achieving high per-frame accuracy on a large, multi-center dataset. The approach enables downstream tasks such as automatic report generation, video retrieval, and refined quality assessments, with future work extending to more colon segments and imaging modes.
Abstract
Following the successful debut of polyp detection and characterization, more advanced automation tools are being developed for colonoscopy. The new automation tasks, such as quality metrics or report generation, require understanding of the procedure flow that includes activities, events, anatomical landmarks, etc. In this work we present a method for automatic semantic parsing of colonoscopy videos. The method uses a novel DL multi-label temporal segmentation model trained in supervised and unsupervised regimes. We evaluate the accuracy of the method on a test set of over 300 annotated colonoscopy videos, and use ablation to explore the relative importance of various method's components.
