Table of Contents
Fetching ...

UVIS: Unsupervised Video Instance Segmentation

Shuaiyi Huang, Saksham Suri, Kamal Gupta, Sai Saketh Rambhatla, Ser-nam Lim, Abhinav Shrivastava

TL;DR

UVIS tackles the challenge of unsupervised video instance segmentation by fusing self-supervised shape priors from DINO with open-set recognition from CLIP. It formulates a three-stage pipeline—pseudo-label generation with CutLER and CLIP, transformer-based VIS training on pseudo-labels, and query-based tracking enhanced by a semantic prototype memory and a tracking memory bank—to produce temporally consistent, per-frame instance masks without any video-level annotations or dense pretraining. The main contributions are the prototype memory filtering to suppress false positives and the tracking memory that encodes long-term temporal information, enabling competitive results on YouTube-VIS 2019/2021 and Occluded-VIS datasets (e.g., AP up to 21.4 on YTVIS-2019). This approach demonstrates that foundation models can drive scalable, annotation-free video understanding and broadens VIS coverage to all categories within a dataset, reducing annotation costs and enabling broader applicability.

Abstract

Video instance segmentation requires classifying, segmenting, and tracking every object across video frames. Unlike existing approaches that rely on masks, boxes, or category labels, we propose UVIS, a novel Unsupervised Video Instance Segmentation (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining. Our key insight comes from leveraging the dense shape prior from the self-supervised vision foundation model DINO and the openset recognition ability from the image-caption supervised vision-language model CLIP. Our UVIS framework consists of three essential steps: frame-level pseudo-label generation, transformer-based VIS model training, and query-based tracking. To improve the quality of VIS predictions in the unsupervised setup, we introduce a dual-memory design. This design includes a semantic memory bank for generating accurate pseudo-labels and a tracking memory bank for maintaining temporal consistency in object tracks. We evaluate our approach on three standard VIS benchmarks, namely YoutubeVIS-2019, YoutubeVIS-2021, and Occluded VIS. Our UVIS achieves 21.1 AP on YoutubeVIS-2019 without any video annotations or dense pretraining, demonstrating the potential of our unsupervised VIS framework.

UVIS: Unsupervised Video Instance Segmentation

TL;DR

UVIS tackles the challenge of unsupervised video instance segmentation by fusing self-supervised shape priors from DINO with open-set recognition from CLIP. It formulates a three-stage pipeline—pseudo-label generation with CutLER and CLIP, transformer-based VIS training on pseudo-labels, and query-based tracking enhanced by a semantic prototype memory and a tracking memory bank—to produce temporally consistent, per-frame instance masks without any video-level annotations or dense pretraining. The main contributions are the prototype memory filtering to suppress false positives and the tracking memory that encodes long-term temporal information, enabling competitive results on YouTube-VIS 2019/2021 and Occluded-VIS datasets (e.g., AP up to 21.4 on YTVIS-2019). This approach demonstrates that foundation models can drive scalable, annotation-free video understanding and broadens VIS coverage to all categories within a dataset, reducing annotation costs and enabling broader applicability.

Abstract

Video instance segmentation requires classifying, segmenting, and tracking every object across video frames. Unlike existing approaches that rely on masks, boxes, or category labels, we propose UVIS, a novel Unsupervised Video Instance Segmentation (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining. Our key insight comes from leveraging the dense shape prior from the self-supervised vision foundation model DINO and the openset recognition ability from the image-caption supervised vision-language model CLIP. Our UVIS framework consists of three essential steps: frame-level pseudo-label generation, transformer-based VIS model training, and query-based tracking. To improve the quality of VIS predictions in the unsupervised setup, we introduce a dual-memory design. This design includes a semantic memory bank for generating accurate pseudo-labels and a tracking memory bank for maintaining temporal consistency in object tracks. We evaluate our approach on three standard VIS benchmarks, namely YoutubeVIS-2019, YoutubeVIS-2021, and Occluded VIS. Our UVIS achieves 21.1 AP on YoutubeVIS-2019 without any video annotations or dense pretraining, demonstrating the potential of our unsupervised VIS framework.
Paper Structure (11 sections, 3 equations, 4 figures, 4 tables)

This paper contains 11 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Setting Overview. Previous approaches have tried to use COCO dense annotations in addition to VIS dataset full supervision (a), box supervision (b) and no supervision (c). Additionally, previous works have also used flow information along with frame-level category labels (d). Our approach UVIS works in the unsupervised setting and does not require any dense labels or per frame labels and instead utilizes foundation models.
  • Figure 2: We present our approach UVIS. On the left we show our pseudo-label generation pipeline which involves generating masks and instance labels using CutLER wang2023cut and CLIP radford2021learning followed by Prototype Memory Filtering (PMF). In the center we show our model training which uses and image encoder and a transformer decoder to learn queries to predict per-frame predictions. On the right we show our proposed tracking memory approach which utilizes per frame queries and a memory based update rule to perform matching between frames to track instances and generate temporally consistent predictions.
  • Figure 3: Visualizations on YoutubeVIS-2019 yang2019video with our UVIS. Each row shows temporal instance mask and class predictions. Our method is able to work for examples containing multiple instances of the same class (rows 1, 3, 4) and also when there are instances from different classes (row 5). UVIS shows promising results when instances of the same class might overlap (row 4).
  • Figure 4: Visualizations of failure cases on YoutubeVIS-2019 yang2019video. On the left we show CLIP labeling failures where the CLIP model incorrectly classifies to the wrong class. In the center we show prediction inconsistencies where multiple instances are predicted as one. On the right we show temporal inconsistencies in predicted masks.