One-shot Training for Video Object Segmentation
Baiyu Chen, Sixian Chan, Xiaoqin Zhang
TL;DR
This work tackles the high annotation cost of video object segmentation by introducing a general one-shot training framework that requires only a single labeled frame per training video. It trains VOS networks end-to-end through bi-directional processing: a time-forward inference to generate masks and a time-backward reconstruction to recover the initial mask, using the reconstructed output to update the model. The approach is model-agnostic and validated across three state-of-the-art networks (STCN, XMem, Cutie), achieving performance close to fully supervised methods on DAVIS and YouTube-VOS with as little as 1.4%–3.7% labeled data. This label-efficient strategy significantly reduces annotation effort while maintaining competitive accuracy, and its simple, end-to-end design facilitates broad applicability and practical deployment.
Abstract
Video Object Segmentation (VOS) aims to track objects across frames in a video and segment them based on the initial annotated frame of the target objects. Previous VOS works typically rely on fully annotated videos for training. However, acquiring fully annotated training videos for VOS is labor-intensive and time-consuming. Meanwhile, self-supervised VOS methods have attempted to build VOS systems through correspondence learning and label propagation. Still, the absence of mask priors harms their robustness to complex scenarios, and the label propagation paradigm makes them impractical in terms of efficiency. To address these issues, we propose, for the first time, a general one-shot training framework for VOS, requiring only a single labeled frame per training video and applicable to a majority of state-of-the-art VOS networks. Specifically, our algorithm consists of: i) Inferring object masks time-forward based on the initial labeled frame. ii) Reconstructing the initial object mask time-backward using the masks from step i). Through this bi-directional training, a satisfactory VOS network can be obtained. Notably, our approach is extremely simple and can be employed end-to-end. Finally, our approach uses a single labeled frame of YouTube-VOS and DAVIS datasets to achieve comparable results to those trained on fully labeled datasets. The code will be released.
