PM-VIS: High-Performance Box-Supervised Video Instance Segmentation
Zhangjing Yang, Dun Liu, Wensheng Cheng, Jinqiao Wang, Yi Wu
TL;DR
This work tackles the high labeling cost of pixel-level video masks by proposing a two-step, box-supervised VIS framework called PM-VIS. It generates three types of pseudo masks from HQ-SAM, IDOL-BoxInst, and DeAOT-tracked Track-masks, then selects and refines them with SCM, DOOB, and SHQM, while filtering ground-truth data via Missing-Data and RIA. PM-VIS integrates pseudo-mask supervision with BoxInst losses to achieve state-of-the-art results on YouTube-VIS 2019/2021 and OVIS, and also demonstrates strong performance when using filtered ground-truth data for fully supervised VIS. The approach narrows the gap between box-supervised and fully supervised VIS, offering a scalable path to high-quality instance segmentation in videos with reduced annotation costs. The method’s practical impact lies in enabling robust, pixel-level VIS predictions from box annotations alone, with potential benefits for video analytics, surveillance, and autonomous systems.
Abstract
Labeling pixel-wise object masks in videos is a resource-intensive and laborious process. Box-supervised Video Instance Segmentation (VIS) methods have emerged as a viable solution to mitigate the labor-intensive annotation process. . In practical applications, the two-step approach is not only more flexible but also exhibits a higher recognition accuracy. Inspired by the recent success of Segment Anything Model (SAM), we introduce a novel approach that aims at harnessing instance box annotations from multiple perspectives to generate high-quality instance pseudo masks, thus enriching the information contained in instance annotations. We leverage ground-truth boxes to create three types of pseudo masks using the HQ-SAM model, the box-supervised VIS model (IDOL-BoxInst), and the VOS model (DeAOT) separately, along with three corresponding optimization mechanisms. Additionally, we introduce two ground-truth data filtering methods, assisted by high-quality pseudo masks, to further enhance the training dataset quality and improve the performance of fully supervised VIS methods. To fully capitalize on the obtained high-quality Pseudo Masks, we introduce a novel algorithm, PM-VIS, to integrate mask losses into IDOL-BoxInst. Our PM-VIS model, trained with high-quality pseudo mask annotations, demonstrates strong ability in instance mask prediction, achieving state-of-the-art performance on the YouTube-VIS 2019, YouTube-VIS 2021, and OVIS validation sets, notably narrowing the gap between box-supervised and fully supervised VIS methods.
