Table of Contents
Fetching ...

LSVOS Challenge Report: Large-scale Complex and Long Video Object Segmentation

Henghui Ding, Lingyi Hong, Chang Liu, Ning Xu, Linjie Yang, Yuchen Fan, Deshui Miao, Yameng Gu, Xin Li, Zhenyu He, Yaowei Wang, Ming-Hsuan Yang, Jinming Chai, Qin Ma, Junpei Zhang, Licheng Jiao, Fang Liu, Xinyu Liu, Jing Zhang, Kexin Zhang, Xu Liu, LingLing Li, Hao Fang, Feiyu Pan, Xiankai Lu, Wei Zhang, Runmin Cong, Tuyen Tran, Bin Cao, Yisi Zhang, Hanyi Wang, Xingjian He, Jing Liu

TL;DR

The paper presents the 6th LSVOS Challenge, addressing the gap between benchmark performance and real-world video complexity by introducing VOS and RVOS tasks evaluated on the MOSE, LVOS, and MeViS datasets. It highlights diverse, memory-augmented and promptable segmentation approaches, including UNINEXT-, MUTR-, Grounding DINO-, and HQ-SAM–based pipelines, as well as SAM2-inspired memory mechanisms. The report analyzes top-performing methods across tracks, demonstrates substantial participation (129 teams), and discusses progress and remaining challenges in long-term temporal coherence, multi-object scenarios, and motion-rich references. Overall, the challenge emphasizes memory, cross-modal cues, and promptable segmentation as key drivers toward scalable, real-world video object segmentation.

Abstract

Despite the promising performance of current video segmentation models on existing benchmarks, these models still struggle with complex scenes. In this paper, we introduce the 6th Large-scale Video Object Segmentation (LSVOS) challenge in conjunction with ECCV 2024 workshop. This year's challenge includes two tasks: Video Object Segmentation (VOS) and Referring Video Object Segmentation (RVOS). In this year, we replace the classic YouTube-VOS and YouTube-RVOS benchmark with latest datasets MOSE, LVOS, and MeViS to assess VOS under more challenging complex environments. This year's challenge attracted 129 registered teams from more than 20 institutes across over 8 countries. This report include the challenge and dataset introduction, and the methods used by top 7 teams in two tracks. More details can be found in our homepage https://lsvos.github.io/.

LSVOS Challenge Report: Large-scale Complex and Long Video Object Segmentation

TL;DR

The paper presents the 6th LSVOS Challenge, addressing the gap between benchmark performance and real-world video complexity by introducing VOS and RVOS tasks evaluated on the MOSE, LVOS, and MeViS datasets. It highlights diverse, memory-augmented and promptable segmentation approaches, including UNINEXT-, MUTR-, Grounding DINO-, and HQ-SAM–based pipelines, as well as SAM2-inspired memory mechanisms. The report analyzes top-performing methods across tracks, demonstrates substantial participation (129 teams), and discusses progress and remaining challenges in long-term temporal coherence, multi-object scenarios, and motion-rich references. Overall, the challenge emphasizes memory, cross-modal cues, and promptable segmentation as key drivers toward scalable, real-world video object segmentation.

Abstract

Despite the promising performance of current video segmentation models on existing benchmarks, these models still struggle with complex scenes. In this paper, we introduce the 6th Large-scale Video Object Segmentation (LSVOS) challenge in conjunction with ECCV 2024 workshop. This year's challenge includes two tasks: Video Object Segmentation (VOS) and Referring Video Object Segmentation (RVOS). In this year, we replace the classic YouTube-VOS and YouTube-RVOS benchmark with latest datasets MOSE, LVOS, and MeViS to assess VOS under more challenging complex environments. This year's challenge attracted 129 registered teams from more than 20 institutes across over 8 countries. This report include the challenge and dataset introduction, and the methods used by top 7 teams in two tracks. More details can be found in our homepage https://lsvos.github.io/.
Paper Structure (15 sections, 9 equations, 7 figures, 2 tables)

This paper contains 15 sections, 9 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overall framework of PCL VisionLab team method, 1st place solution for 6th LSVOS Challenge in ECCV 2024.
  • Figure 2: Workflow of the CSS-Segment. Image encoder is a streaming approach, consuming video frames as they become available. Mask encoder using convolutions and summed element-wise with the image embedding. We store pixel memory and object memory representations from past segmented (memory) frames. Pixel memory is retrieved for the query frame as pixel readout, which bidirectionally interacts with object queries and object memory in the object transformer. The object transformer blocks enrich the pixel feature with object-level semantics and produce the final object readout for decoding into the output mask.
  • Figure 3: An overview of the Dual-Model VOS Enhancement VOS framework. The figure illustrates the key components of our approach, including the memory-based paradigm, pixel-level matching, and object query mechanism.
  • Figure 4: The SAM 2 architecture ravi2024sam2
  • Figure 5: The overview architecture of the proposed method from MVP-TIME: The 1st Solution for LSVOS Challenge RVOS Track.
  • ...and 2 more figures