Table of Contents
Fetching ...

Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training

Kaixuan Lu, Mehmet Onurcan Kaya, Dim P. Papadopoulos

TL;DR

AutoQ-VIS tackles the challenge of unsupervised video instance segmentation by introducing a quality-guided self-training loop that progressively augments training data with automatically curated pseudo-labels. It hinges on a Mask Quality Predictor, DropLoss for stabilizing the mask head, and adaptive fusion to combine new pseudo-labels with existing annotations, all starting from synthetic data provided by VideoCutLER. The approach achieves state-of-the-art results on YouTubeVIS-2019 (52.6 AP50, +4.4 AP50 over the previous SOTA) and shows generalization to UVO-Dense, demonstrating effective synthetic-to-real domain adaptation without human annotations. These findings underscore the viability of quality-aware self-training for unsupervised VIS and its potential for scalable, annotation-free video understanding.

Abstract

Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 $\text{AP}_{50}$ on YouTubeVIS-2019 $\texttt{val}$ set, surpassing the previous state-of-the-art VideoCutLER by 4.4%, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. We will release the code at https://github.com/wcbup/AutoQ-VIS.

Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training

TL;DR

AutoQ-VIS tackles the challenge of unsupervised video instance segmentation by introducing a quality-guided self-training loop that progressively augments training data with automatically curated pseudo-labels. It hinges on a Mask Quality Predictor, DropLoss for stabilizing the mask head, and adaptive fusion to combine new pseudo-labels with existing annotations, all starting from synthetic data provided by VideoCutLER. The approach achieves state-of-the-art results on YouTubeVIS-2019 (52.6 AP50, +4.4 AP50 over the previous SOTA) and shows generalization to UVO-Dense, demonstrating effective synthetic-to-real domain adaptation without human annotations. These findings underscore the viability of quality-aware self-training for unsupervised VIS and its potential for scalable, annotation-free video understanding.

Abstract

Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 on YouTubeVIS-2019 set, surpassing the previous state-of-the-art VideoCutLER by 4.4%, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. We will release the code at https://github.com/wcbup/AutoQ-VIS.

Paper Structure

This paper contains 16 sections, 22 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: AutoQ-VIS overview. In the initial training stage, both the VIS model and the mask quality predictor are trained on synthetic videos with pseudo annotations videocutler. During the multi-round self-training stage, the VIS model generates pseudo masks on unlabeled videos, which are then scored by the frozen quality predictor. Pseudo masks with high predicted quality are selected and added to the training set. The VIS model is subsequently retrained on both the synthetic data and the selected pseudo labels, enabling iterative refinement and progressive performance gains.
  • Figure 2: Network architecture of VideoMask2Former mask2formervideomask2former and Mask Quality Predictor. Our quality predictor integrates mask predictions and pixel decoder features following maskscore, employing a sequential architecture with four convolution layers (3$\times$3 kernels, final layer stride of 2 for spatial reduction) followed by three fully-connected layers that ultimately produce mask IoU predictions.
  • Figure 3: The qualitative results of AutoQ-VIS on YouTubeVIS-2019 val split. The quality scores are shown in the center of each object. The visual results demonstrate AutoQ-VIS' proficiency in simultaneous multi-instance segmentation, persistent object tracking, and per-mask quality assessment across video sequences.
  • Figure 4: The qualitative comparison on YouTubeVIS-2019 val split. AutoQ-VIS demonstrates superior instance discovery capabilities compared to VideoCutLER videocutler: (1) Enhanced multi-object detection capacity, particularly for semantically distinct instances (e.g., person and bull in Column 2); (2) Improved segmentation fidelity through precise boundary delineation (e.g., the leopard in Column 3). (3) Better comprehensive instance coverage, eliminating false negatives (e.g., detecting humans in Columns 1 & 4 that VideoCutLER completely misses, even without occlusion or scale challenges).
  • Figure 5: Visualized comparison of quality score $Q_l$ and confidence score $s_l$ on YouTubeVIS-2019 val split. Here, $\rho_{s}$ denotes the Spearman's rank correlation coefficient. Subplot (a) visualizes quality scores $Q_l$ and their ground truth IoU. Subplot (b) visualizes confidence scores $s_l$ and their ground truth IoU.
  • ...and 4 more figures