Table of Contents
Fetching ...

PM-VIS: High-Performance Box-Supervised Video Instance Segmentation

Zhangjing Yang, Dun Liu, Wensheng Cheng, Jinqiao Wang, Yi Wu

TL;DR

This work tackles the high labeling cost of pixel-level video masks by proposing a two-step, box-supervised VIS framework called PM-VIS. It generates three types of pseudo masks from HQ-SAM, IDOL-BoxInst, and DeAOT-tracked Track-masks, then selects and refines them with SCM, DOOB, and SHQM, while filtering ground-truth data via Missing-Data and RIA. PM-VIS integrates pseudo-mask supervision with BoxInst losses to achieve state-of-the-art results on YouTube-VIS 2019/2021 and OVIS, and also demonstrates strong performance when using filtered ground-truth data for fully supervised VIS. The approach narrows the gap between box-supervised and fully supervised VIS, offering a scalable path to high-quality instance segmentation in videos with reduced annotation costs. The method’s practical impact lies in enabling robust, pixel-level VIS predictions from box annotations alone, with potential benefits for video analytics, surveillance, and autonomous systems.

Abstract

Labeling pixel-wise object masks in videos is a resource-intensive and laborious process. Box-supervised Video Instance Segmentation (VIS) methods have emerged as a viable solution to mitigate the labor-intensive annotation process. . In practical applications, the two-step approach is not only more flexible but also exhibits a higher recognition accuracy. Inspired by the recent success of Segment Anything Model (SAM), we introduce a novel approach that aims at harnessing instance box annotations from multiple perspectives to generate high-quality instance pseudo masks, thus enriching the information contained in instance annotations. We leverage ground-truth boxes to create three types of pseudo masks using the HQ-SAM model, the box-supervised VIS model (IDOL-BoxInst), and the VOS model (DeAOT) separately, along with three corresponding optimization mechanisms. Additionally, we introduce two ground-truth data filtering methods, assisted by high-quality pseudo masks, to further enhance the training dataset quality and improve the performance of fully supervised VIS methods. To fully capitalize on the obtained high-quality Pseudo Masks, we introduce a novel algorithm, PM-VIS, to integrate mask losses into IDOL-BoxInst. Our PM-VIS model, trained with high-quality pseudo mask annotations, demonstrates strong ability in instance mask prediction, achieving state-of-the-art performance on the YouTube-VIS 2019, YouTube-VIS 2021, and OVIS validation sets, notably narrowing the gap between box-supervised and fully supervised VIS methods.

PM-VIS: High-Performance Box-Supervised Video Instance Segmentation

TL;DR

This work tackles the high labeling cost of pixel-level video masks by proposing a two-step, box-supervised VIS framework called PM-VIS. It generates three types of pseudo masks from HQ-SAM, IDOL-BoxInst, and DeAOT-tracked Track-masks, then selects and refines them with SCM, DOOB, and SHQM, while filtering ground-truth data via Missing-Data and RIA. PM-VIS integrates pseudo-mask supervision with BoxInst losses to achieve state-of-the-art results on YouTube-VIS 2019/2021 and OVIS, and also demonstrates strong performance when using filtered ground-truth data for fully supervised VIS. The approach narrows the gap between box-supervised and fully supervised VIS, offering a scalable path to high-quality instance segmentation in videos with reduced annotation costs. The method’s practical impact lies in enabling robust, pixel-level VIS predictions from box annotations alone, with potential benefits for video analytics, surveillance, and autonomous systems.

Abstract

Labeling pixel-wise object masks in videos is a resource-intensive and laborious process. Box-supervised Video Instance Segmentation (VIS) methods have emerged as a viable solution to mitigate the labor-intensive annotation process. . In practical applications, the two-step approach is not only more flexible but also exhibits a higher recognition accuracy. Inspired by the recent success of Segment Anything Model (SAM), we introduce a novel approach that aims at harnessing instance box annotations from multiple perspectives to generate high-quality instance pseudo masks, thus enriching the information contained in instance annotations. We leverage ground-truth boxes to create three types of pseudo masks using the HQ-SAM model, the box-supervised VIS model (IDOL-BoxInst), and the VOS model (DeAOT) separately, along with three corresponding optimization mechanisms. Additionally, we introduce two ground-truth data filtering methods, assisted by high-quality pseudo masks, to further enhance the training dataset quality and improve the performance of fully supervised VIS methods. To fully capitalize on the obtained high-quality Pseudo Masks, we introduce a novel algorithm, PM-VIS, to integrate mask losses into IDOL-BoxInst. Our PM-VIS model, trained with high-quality pseudo mask annotations, demonstrates strong ability in instance mask prediction, achieving state-of-the-art performance on the YouTube-VIS 2019, YouTube-VIS 2021, and OVIS validation sets, notably narrowing the gap between box-supervised and fully supervised VIS methods.
Paper Structure (54 sections, 8 equations, 10 figures, 11 tables)

This paper contains 54 sections, 8 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Comparing pixel-level quality visualization results among the pseudo mask (SAM) from SAM-Masks, the pseudo mask (HQ-SAM) from HQ-SAM-masks, the pseudo mask (OURS) from Track-masks-final, and the mask (GT) from gtmasks.
  • Figure 2: The distribution of IoU between SAM-Masks and gtmasks, as well as between HQ-SAM-masks and gtmasks. The horizontal and vertical axes represent the IoU ranges and the percentage of instances within the range, respectively.
  • Figure 3: The correlation between mask IoU and algorithm mask AP for pseudo masks and gtmasks. The PM-VIS model is trained on HQ-SAM-masks, IDOL-BoxInst-masks, and Track-masks, along with their derived pseudo mask collections, and its performance is evaluated in terms of mask AP on the YTVIS2019 validation set.
  • Figure 4: Illustration of high-performance box-supervised VIS. Given a dataset with box annotations, our method consists of two steps.
  • Figure 5: Pipeline for high-quality pseudo masks generation. Given a video sequence with box annotations, our pipeline consists of two stages. In Stage 1, we employ the HQ-SAM model and the IDOL-BoxInst model to generate pixel-level predictions for the target objects, resulting in HQ-SAM-masks and IDOL-BoxInst-masks, respectively. Stage 2 involves selecting appropriate keyframes from the IDOL-BoxInst-masks to initialize the DeAOT model, resulting in higher-quality pixel-level predictions, represented as Track-masks. Track-masks-final is derived by selecting higher-quality pseudo masks from HQ-SAM-masks, IDOL-Box-masks and Track-masks using the SHQM method. Notably, dashed lines indicate auxiliary usage where HQ-SAM-masks is not directly utilized in subsequent applications. Conversely, solid lines represent direct usage. Two auxiliary mechanisms, SCM and SHQM, are employed to assist in selecting high-quality candidates for pseudo masks.
  • ...and 5 more figures