PM-VIS+: High-Performance Video Instance Segmentation without Video Annotation

Zhangjing Yang; Dun Liu; Xin Wang; Zhe Li; Barathwaj Anandan; Yi Wu

PM-VIS+: High-Performance Video Instance Segmentation without Video Annotation

Zhangjing Yang, Dun Liu, Xin Wang, Zhe Li, Barathwaj Anandan, Yi Wu

TL;DR

PM-VIS+ tackles video instance segmentation without manual video annotations by training on image datasets and supplementing missing categories with ImageNet-bbox. It introduces a three-stage pipeline: generate pseudo-labeled video data from an image-trained PM-VIS+(Image), refine those labels with DeAOT and score-based filtering, and train PM-VIS+(Video) on the optimized data. The approach uses dynamic supervision signals that combine pixel-level and bounding-box annotations, enabling VIS performance competitive with fully supervised methods while reducing labeling costs. Experiments across COCO, YTVIS, and OVIS demonstrate effective gains from pseudo-label optimization and backbone scaling (ResNet-50 and Swin-L), highlighting practical impact for data-constrained VIS deployment.

Abstract

Video instance segmentation requires detecting, segmenting, and tracking objects in videos, typically relying on costly video annotations. This paper introduces a method that eliminates video annotations by utilizing image datasets. The PM-VIS algorithm is adapted to handle both bounding box and instance-level pixel annotations dynamically. We introduce ImageNet-bbox to supplement missing categories in video datasets and propose the PM-VIS+ algorithm to adjust supervision based on annotation types. To enhance accuracy, we use pseudo masks and semi-supervised optimization techniques on unannotated video data. This method achieves high video instance segmentation performance without manual video annotations, offering a cost-effective solution and new perspectives for video instance segmentation applications. The code will be available in https://github.com/ldknight/PM-VIS-plus

PM-VIS+: High-Performance Video Instance Segmentation without Video Annotation

TL;DR

Abstract

Paper Structure (20 sections, 4 figures, 12 tables)

This paper contains 20 sections, 4 figures, 12 tables.

Introduction
Related work
Video instance segmentation
methodology
Method flow
Model training process
Video pseudo-label data optimization strategy
Experiment
Datasets
Experimental setup
ablation experiment
Training different experimental configurations of the PM-VIS+ (Image) model on the image dataset
The impact of the hyperparameter K of the TopK filtering method for pseudo-labeled video data on the recognition rate of PM-VIS+ (Video)
The impact of the hyperparameter $\tau$ of the PScore filtering method on the recognition rate of PM-VIS+ (Video)
The impact of different supervision signals on the pseudo-labeled model PM-VIS+ (Video)
...and 5 more sections

Figures (4)

Figure 1: Method flow diagram.
Figure 2: Model training process.
Figure 3: PM-VIS+(Image) visualization of missed detection data relative to real data.
Figure 4: PM-VIS+(Image) visualization of reasoning results.

PM-VIS+: High-Performance Video Instance Segmentation without Video Annotation

TL;DR

Abstract

PM-VIS+: High-Performance Video Instance Segmentation without Video Annotation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)