Semi-Supervised Pipe Video Temporal Defect Interval Localization
Zhu Huang, Gang Pan, Chao Kang, YaoZhi Lv
TL;DR
This paper tackles the challenge of temporal defect interval localization in sewer pipe CCTV videos, where time-interval annotations are scarce and pipe-video dynamics differ from standard TAL tasks. It introduces PipeSPO, a semi-supervised framework that combines an unsupervised contrastive pretext task for learning robust video sequence representations with a semi-supervised, multi-prototype temporal localization stage guided by monocular visual odometry. Key contributions include a clustering-based multi-prototype memory, a prototype-aware decoder, a visual odometry attention module, and the use of 3D-DWT dynamic features to capture temporal textures, all integrating to produce superior interval localization. Empirical results on real-world datasets show PipeSPO achieving an average AP of 41.89% across IoU thresholds 0.1–0.7, outperforming state-of-the-art methods by substantial margins and demonstrating practical value for CTDIL in sewer maintenance.
Abstract
In sewer pipe Closed-Circuit Television (CCTV) inspection, accurate temporal defect localization is essential for effective defect classification, detection, segmentation and quantification. Industry standards typically do not require time-interval annotations, even though they are more informative than time-point annotations for defect localization, resulting in additional annotation costs when fully supervised methods are used. Additionally, differences in scene types and camera motion patterns between pipe inspections and Temporal Action Localization (TAL) hinder the effective transfer of point-supervised TAL methods. Therefore, this study introduces a Semi-supervised multi-Prototype-based method incorporating visual Odometry for enhanced attention guidance (PipeSPO). PipeSPO fully leverages unlabeled data through unsupervised pretext tasks and utilizes time-point annotated data with a weakly supervised multi-prototype-based method, relying on visual odometry features to capture camera pose information. Experiments on real-world datasets demonstrate that PipeSPO achieves 41.89% average precision across Intersection over Union (IoU) thresholds of 0.1-0.7, improving by 8.14% over current state-of-the-art methods.
