Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training
Kaixuan Lu, Mehmet Onurcan Kaya, Dim P. Papadopoulos
TL;DR
AutoQ-VIS tackles the challenge of unsupervised video instance segmentation by introducing a quality-guided self-training loop that progressively augments training data with automatically curated pseudo-labels. It hinges on a Mask Quality Predictor, DropLoss for stabilizing the mask head, and adaptive fusion to combine new pseudo-labels with existing annotations, all starting from synthetic data provided by VideoCutLER. The approach achieves state-of-the-art results on YouTubeVIS-2019 (52.6 AP50, +4.4 AP50 over the previous SOTA) and shows generalization to UVO-Dense, demonstrating effective synthetic-to-real domain adaptation without human annotations. These findings underscore the viability of quality-aware self-training for unsupervised VIS and its potential for scalable, annotation-free video understanding.
Abstract
Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 $\text{AP}_{50}$ on YouTubeVIS-2019 $\texttt{val}$ set, surpassing the previous state-of-the-art VideoCutLER by 4.4%, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. We will release the code at https://github.com/wcbup/AutoQ-VIS.
