Table of Contents
Fetching ...

Point-VOS: Pointing Up Video Object Segmentation

Idil Esen Zulfikar, Sabarinath Mahadevan, Paul Voigtlaender, Bastian Leibe

TL;DR

Point-VOS tackles the high annotation cost of dense VOS masks by introducing sparse spatio-temporal point annotations for both training and testing. The approach yields two large multi-modal datasets, PV-Oops and PV-Kinetics, totaling about 19M points over 133K objects in 32K videos, and establishes a Point-VOS benchmark with strong point-based baselines and pseudo-mask training that approach full-mask performance. The work demonstrates substantial gains from additional point-based data and pseudo-masks, and extends to language-grounded VOS via Video Localized Narratives, showing meaningful improvements on Video Narrative Grounding tasks. Collectively, Point-VOS enables scalable VOS and cross-modal vision-language research with practical annotation efficiency and strong real-world impact.

Abstract

Current state-of-the-art Video Object Segmentation (VOS) methods rely on dense per-object mask annotations both during training and testing. This requires time-consuming and costly video annotation mechanisms. We propose a novel Point-VOS task with a spatio-temporally sparse point-wise annotation scheme that substantially reduces the annotation effort. We apply our annotation scheme to two large-scale video datasets with text descriptions and annotate over 19M points across 133K objects in 32K videos. Based on our annotations, we propose a new Point-VOS benchmark, and a corresponding point-based training mechanism, which we use to establish strong baseline results. We show that existing VOS methods can easily be adapted to leverage our point annotations during training, and can achieve results close to the fully-supervised performance when trained on pseudo-masks generated from these points. In addition, we show that our data can be used to improve models that connect vision and language, by evaluating it on the Video Narrative Grounding (VNG) task. We will make our code and annotations available at https://pointvos.github.io.

Point-VOS: Pointing Up Video Object Segmentation

TL;DR

Point-VOS tackles the high annotation cost of dense VOS masks by introducing sparse spatio-temporal point annotations for both training and testing. The approach yields two large multi-modal datasets, PV-Oops and PV-Kinetics, totaling about 19M points over 133K objects in 32K videos, and establishes a Point-VOS benchmark with strong point-based baselines and pseudo-mask training that approach full-mask performance. The work demonstrates substantial gains from additional point-based data and pseudo-masks, and extends to language-grounded VOS via Video Localized Narratives, showing meaningful improvements on Video Narrative Grounding tasks. Collectively, Point-VOS enables scalable VOS and cross-modal vision-language research with practical annotation efficiency and strong real-world impact.

Abstract

Current state-of-the-art Video Object Segmentation (VOS) methods rely on dense per-object mask annotations both during training and testing. This requires time-consuming and costly video annotation mechanisms. We propose a novel Point-VOS task with a spatio-temporally sparse point-wise annotation scheme that substantially reduces the annotation effort. We apply our annotation scheme to two large-scale video datasets with text descriptions and annotate over 19M points across 133K objects in 32K videos. Based on our annotations, we propose a new Point-VOS benchmark, and a corresponding point-based training mechanism, which we use to establish strong baseline results. We show that existing VOS methods can easily be adapted to leverage our point annotations during training, and can achieve results close to the fully-supervised performance when trained on pseudo-masks generated from these points. In addition, we show that our data can be used to improve models that connect vision and language, by evaluating it on the Video Narrative Grounding (VNG) task. We will make our code and annotations available at https://pointvos.github.io.
Paper Structure (16 sections, 20 figures, 11 tables)

This paper contains 16 sections, 20 figures, 11 tables.

Figures (20)

  • Figure 1: Comparison of the conventional VOS task with our new Point-VOS task. (a) The conventional VOS task utilizes dense segmentation mask for each frame during training and initializes the first-frame reference with dense masks. (b) We propose to change this paradigm and use only spatially sparse point annotations on a sparse subset of frames during training, and only a few points for the first-frame reference initialization. Green and blue dots represent foreground points and red dots background points.
  • Figure 2: Training vs. test-time point supervision results using simulated points on the DAVIS validation set.$\blacklozenge$ represents our chosen setting, i.e. 10 points for training supervision and 10 points for test-time supervision. We run each experiment 5 times and report the mean score.
  • Figure 3: STCN results on DAVIS validation set for varying temporal sparsity, when trained on 10 randomly sampled points per frame per object.$\bigstar$ represents our chosen setting, i.e.10 points for training supervision and 10 points for test-time supervision, on 10 frames. We run each experiment 3 times and report the mean score.
  • Figure 4: Semi-automatic annotation pipeline used to annotate VidLN data. We first extract a mouse trace segment for each noun in VidLN captions, and convert it into a pseudo mask using DynaMITe. We then use STCN to propagate the pseudo-mask across the video. We then use the STCN output probability maps to sample sparse point annotations and let annotators verify them. Green circles represent foreground points and red circles background points.
  • Figure 5: Example point annotations for PV-Oops (top) and PV-Kinetics (bottom). The objects are connected to nouns from a large vocabulary. Green dots represent foreground points and red dots background points.
  • ...and 15 more figures