Table of Contents
Fetching ...

LVOS: A Benchmark for Large-scale Long-term Video Object Segmentation

Lingyi Hong, Zhongying Liu, Wenchao Chen, Chenzhi Tan, Yuang Feng, Xinyu Zhou, Pinxue Guo, Jinglun Li, Zhaoyu Chen, Shuyong Gao, Wei Zhang, Wenqiang Zhang

TL;DR

LVOS introduces the first large-scale, densely annotated benchmark for long-term video object segmentation, addressing the mismatch between real-world long-duration videos and existing short-term datasets. It provides 720 long videos (avg 1.14 minutes), extensive annotations, and a semi-automatic labeling pipeline to enable scalable ground truth. Through comprehensive experiments across semi-supervised, unsupervised, and interactive settings, LVOS reveals that video length and error accumulation are key bottlenecks, and demonstrates that training on LVOS can substantially improve long-term VOS performance. The dataset, along with detailed attribute analyses and oracle experiments, offers actionable insights for designing robust long-term VOS methods and outlines future directions in long-term memory, appearance modeling, and annotation-efficient learning.

Abstract

Video object segmentation (VOS) aims to distinguish and track target objects in a video. Despite the excellent performance achieved by off-the-shell VOS models, existing VOS benchmarks mainly focus on short-term videos lasting about 5 seconds, where objects remain visible most of the time. However, these benchmarks poorly represent practical applications, and the absence of long-term datasets restricts further investigation of VOS in realistic scenarios. Thus, we propose a novel benchmark named LVOS, comprising 720 videos with 296,401 frames and 407,945 high-quality annotations. Videos in LVOS last 1.14 minutes on average, approximately 5 times longer than videos in existing datasets. Each video includes various attributes, especially challenges deriving from the wild, such as long-term reappearing and cross-temporal similar objects. Compared to previous benchmarks, our LVOS better reflects VOS models' performance in real scenarios. Based on LVOS, we evaluate 20 existing VOS models under 4 different settings and conduct a comprehensive analysis. On LVOS, these models suffer a large performance drop, highlighting the challenge of achieving precise tracking and segmentation in real-world scenarios. Attribute-based analysis indicates that key factor to accuracy decline is the increased video length, emphasizing LVOS's crucial role. We hope our LVOS can advance development of VOS in real scenes. Data and code are available at https://lingyihongfd.github.io/lvos.github.io/.

LVOS: A Benchmark for Large-scale Long-term Video Object Segmentation

TL;DR

LVOS introduces the first large-scale, densely annotated benchmark for long-term video object segmentation, addressing the mismatch between real-world long-duration videos and existing short-term datasets. It provides 720 long videos (avg 1.14 minutes), extensive annotations, and a semi-automatic labeling pipeline to enable scalable ground truth. Through comprehensive experiments across semi-supervised, unsupervised, and interactive settings, LVOS reveals that video length and error accumulation are key bottlenecks, and demonstrates that training on LVOS can substantially improve long-term VOS performance. The dataset, along with detailed attribute analyses and oracle experiments, offers actionable insights for designing robust long-term VOS methods and outlines future directions in long-term memory, appearance modeling, and annotation-efficient learning.

Abstract

Video object segmentation (VOS) aims to distinguish and track target objects in a video. Despite the excellent performance achieved by off-the-shell VOS models, existing VOS benchmarks mainly focus on short-term videos lasting about 5 seconds, where objects remain visible most of the time. However, these benchmarks poorly represent practical applications, and the absence of long-term datasets restricts further investigation of VOS in realistic scenarios. Thus, we propose a novel benchmark named LVOS, comprising 720 videos with 296,401 frames and 407,945 high-quality annotations. Videos in LVOS last 1.14 minutes on average, approximately 5 times longer than videos in existing datasets. Each video includes various attributes, especially challenges deriving from the wild, such as long-term reappearing and cross-temporal similar objects. Compared to previous benchmarks, our LVOS better reflects VOS models' performance in real scenarios. Based on LVOS, we evaluate 20 existing VOS models under 4 different settings and conduct a comprehensive analysis. On LVOS, these models suffer a large performance drop, highlighting the challenge of achieving precise tracking and segmentation in real-world scenarios. Attribute-based analysis indicates that key factor to accuracy decline is the increased video length, emphasizing LVOS's crucial role. We hope our LVOS can advance development of VOS in real scenes. Data and code are available at https://lingyihongfd.github.io/lvos.github.io/.
Paper Structure (17 sections, 7 figures, 11 tables)

This paper contains 17 sections, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Example sequences of our Large-scale Long-term Video Object Segmentation (LVOS). Compared to previous video object segmentation datasets, LVOS presents greater challenges, with the main difficulties stemming from longer video durations, intricate scenes, frequent disappearance and reappearance of objects, Cross-temporal confusion, and small objects. Text in green denotes that target object is occluded, while text in orange denotes that target object is out-of-view.
  • Figure 2: Annotation Pipeline, including four steps. Step 1: 1 FPS Automatic Segmentation. Object tracking wei2023autoregressive models and SAM kirillov2023segment are adopted to automatically segment the target object at 1 FPS. Step 2: 1 FPS Manual Correction. We refine and correct masks obtained in Step 1 manually. Step 3: Mask Propagation from 1 FPS to 6 FPS. We propagate masks from 1 FPS to 6 FPS by utilizing a VOS model yang2022decoupling. Step 4: 6 FPS Manual Correction. We manually correct the masks obtained in Step 3.
  • Figure 3: The histogram of instance masks for five parent classes and sub-classes. Objects are sorted by frequency. The entire category set roughly covers diverse objects and motions that occur in everyday scenarios.
  • Figure 4: Attributes distribution in LVOS. In sub-figure (b), the link indicates the high likelihood that more than one attributes will appear in a sequence. Best viewed in color.
  • Figure 5: Cumulative frequency graph of target box areas (expressed as percentages of the total image area) for different datasets. (a) displays the cumulative frequency graph based on annotations from all frames. (b) shows the cumulative frequency graph of the first frame annotations.
  • ...and 2 more figures