Table of Contents
Fetching ...

One-shot Training for Video Object Segmentation

Baiyu Chen, Sixian Chan, Xiaoqin Zhang

TL;DR

This work tackles the high annotation cost of video object segmentation by introducing a general one-shot training framework that requires only a single labeled frame per training video. It trains VOS networks end-to-end through bi-directional processing: a time-forward inference to generate masks and a time-backward reconstruction to recover the initial mask, using the reconstructed output to update the model. The approach is model-agnostic and validated across three state-of-the-art networks (STCN, XMem, Cutie), achieving performance close to fully supervised methods on DAVIS and YouTube-VOS with as little as 1.4%–3.7% labeled data. This label-efficient strategy significantly reduces annotation effort while maintaining competitive accuracy, and its simple, end-to-end design facilitates broad applicability and practical deployment.

Abstract

Video Object Segmentation (VOS) aims to track objects across frames in a video and segment them based on the initial annotated frame of the target objects. Previous VOS works typically rely on fully annotated videos for training. However, acquiring fully annotated training videos for VOS is labor-intensive and time-consuming. Meanwhile, self-supervised VOS methods have attempted to build VOS systems through correspondence learning and label propagation. Still, the absence of mask priors harms their robustness to complex scenarios, and the label propagation paradigm makes them impractical in terms of efficiency. To address these issues, we propose, for the first time, a general one-shot training framework for VOS, requiring only a single labeled frame per training video and applicable to a majority of state-of-the-art VOS networks. Specifically, our algorithm consists of: i) Inferring object masks time-forward based on the initial labeled frame. ii) Reconstructing the initial object mask time-backward using the masks from step i). Through this bi-directional training, a satisfactory VOS network can be obtained. Notably, our approach is extremely simple and can be employed end-to-end. Finally, our approach uses a single labeled frame of YouTube-VOS and DAVIS datasets to achieve comparable results to those trained on fully labeled datasets. The code will be released.

One-shot Training for Video Object Segmentation

TL;DR

This work tackles the high annotation cost of video object segmentation by introducing a general one-shot training framework that requires only a single labeled frame per training video. It trains VOS networks end-to-end through bi-directional processing: a time-forward inference to generate masks and a time-backward reconstruction to recover the initial mask, using the reconstructed output to update the model. The approach is model-agnostic and validated across three state-of-the-art networks (STCN, XMem, Cutie), achieving performance close to fully supervised methods on DAVIS and YouTube-VOS with as little as 1.4%–3.7% labeled data. This label-efficient strategy significantly reduces annotation effort while maintaining competitive accuracy, and its simple, end-to-end design facilitates broad applicability and practical deployment.

Abstract

Video Object Segmentation (VOS) aims to track objects across frames in a video and segment them based on the initial annotated frame of the target objects. Previous VOS works typically rely on fully annotated videos for training. However, acquiring fully annotated training videos for VOS is labor-intensive and time-consuming. Meanwhile, self-supervised VOS methods have attempted to build VOS systems through correspondence learning and label propagation. Still, the absence of mask priors harms their robustness to complex scenarios, and the label propagation paradigm makes them impractical in terms of efficiency. To address these issues, we propose, for the first time, a general one-shot training framework for VOS, requiring only a single labeled frame per training video and applicable to a majority of state-of-the-art VOS networks. Specifically, our algorithm consists of: i) Inferring object masks time-forward based on the initial labeled frame. ii) Reconstructing the initial object mask time-backward using the masks from step i). Through this bi-directional training, a satisfactory VOS network can be obtained. Notably, our approach is extremely simple and can be employed end-to-end. Finally, our approach uses a single labeled frame of YouTube-VOS and DAVIS datasets to achieve comparable results to those trained on fully labeled datasets. The code will be released.
Paper Structure (26 sections, 6 equations, 7 figures, 4 tables)

This paper contains 26 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The main concept illustration. (a) We found that existing video object segmentation networks trained from a noisy reference mask (empty or all-black) can predict the rough mask in a video sequence. (b) To utilizing this property, we regard the rough prediction as the noisy reference to build a feedback loop, and propose a straightforward One-shot Training framework for video object segmentation, which exhibits great label-efficiency and generalization.
  • Figure 2: Left: The traditional fully-supervised training paradigm for VOS networks. At each time, object masks predicted through a VOS network $\Theta$ are aligned with the corresponding labels, updating the network $\Theta$. Right: Our proposed One-shot Training framework for VOS networks. We first use a VOS network $\Theta$ to infer masks time-forward from the initial labeled frame, but without matching the intermidiate predictions. Then, we reconstruct the initial object mask from the last prediction mask.
  • Figure 3: (a) T-step backward: simlpy reversing frames in time, and go through the VOS network $\Theta$ similar to time-forward inference. (b) 1-step backward: regarding the initial frame adjacent to the current frame, directly go through $\Theta$ once to reconstruct the initial mask. (c) 2-step backward: randomly sampling another frame as the only intermediate frame between the current frame and the initial frame, go through $\Theta$ twice to predict the initial mask.
  • Figure 4: Qualitative comparisons of STCN STCN, XMem Xmem and Cutie Cutie trained with one-shot training and fully-supervised training on DAVIS 2017 val. The first pair and third pair comparisons demonstrate that our approach achieves competitive performances in contrast to the fully-supervised. The second comparison on XMem shows that the fully-supervised model is more sensitive to details.
  • Figure 5: Qualitative comparisons of STCN STCN, XMem Xmem, and Cutie Cutie trained with one-shot training and fully-supervised training on YouTube-VOS 2018 val. Although subtle errors can be observed in the fourth row, our predictions look satisfactory.
  • ...and 2 more figures