Table of Contents
Fetching ...

LSVOS 2025 Challenge Report: Recent Advances in Complex Video Object Segmentation

Chang Liu, Henghui Ding, Kaining Ying, Lingyi Hong, Ning Xu, Linjie Yang, Yuchen Fan, Mingqi Gao, Jingkun Chen, Yunqi Miao, Gengshen Wu, Zhijin Qin, Jungong Han, Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Chang Soo Lim, Joonyoung Moon, Donghyeon Cho, Tingmin Li, Yixuan Li, Yang Yang, An Yan, Leilei Cao, Feng Lu, Ran Hong, Youhai Jiang, Fengjie Zhu, Yujie Xie, Hongyang Zhang, Zhihui Liu, Shihai Ruan, Quanzhu Niu, Dengxian Gong, Shihao Chen, Tao Zhang, Yikang Zhou, Haobo Yuan, Lu Qi, Xiangtai Li, Shunping Ji, Ran Hong, Feng Lu, Leilei Cao, An Yan, Alexey Nekrasov, Ali Athar, Daan de Geus, Alexander Hermans, Bastian Leibe

TL;DR

The paper surveys the LSVOS 2025 Challenge, introducing MOSEv2 as a significantly more challenging complex VOS benchmark and detailing the three tracks—MOSEv2, VOS, and RVOS—along with their evaluation metrics, datasets, and public benchmarking setup. It highlights top-performing solutions across tracks, including memory-guided approaches (Long-term Grounding Memory and Concept-aware Memory), LVLM-enhanced segmentation pipelines, and confidence-guided ensemble methods. A cross-track analysis reveals the growing importance of LLM/MLLM components and memory-aware propagation for robust, long-term segmentation under occlusion, clutter, and appearance changes, as well as the need for language-grounded, interactive capabilities. The findings indicate substantial headroom in MOSEv2 and emphasize practical implications for resilient, language-aware video understanding in the wild, with clear avenues for future research in memory mechanisms, semantic grounding, and cross-modal integration. Overall, MOSEv2 pushes models toward robust, long-horizon, language-informed VOS, while traditional VOS/RVOS tracks reinforce the value of model ensembles and memory-aware strategies in real-world scenes.

Abstract

This report presents an overview of the 7th Large-scale Video Object Segmentation (LSVOS) Challenge held in conjunction with ICCV 2025. Besides the two traditional tracks of LSVOS that jointly target robustness in realistic video scenarios: Classic VOS (VOS), and Referring VOS (RVOS), the 2025 edition features a newly introduced track, Complex VOS (MOSEv2). Building upon prior insights, MOSEv2 substantially increases difficulty, introducing more challenging but realistic scenarios including denser small objects, frequent disappear/reappear events, severe occlusions, adverse weather and lighting, etc., pushing long-term consistency and generalization beyond curated benchmarks. The challenge retains standard ${J}$, $F$, and ${J\&F}$ metrics for VOS and RVOS, while MOSEv2 adopts ${J\&\dot{F}}$ as the primary ranking metric to better evaluate objects across scales and disappearance cases. We summarize datasets and protocols, highlight top-performing solutions, and distill emerging trends, such as the growing role of LLM/MLLM components and memory-aware propagation, aiming to chart future directions for resilient, language-aware video segmentation in the wild.

LSVOS 2025 Challenge Report: Recent Advances in Complex Video Object Segmentation

TL;DR

The paper surveys the LSVOS 2025 Challenge, introducing MOSEv2 as a significantly more challenging complex VOS benchmark and detailing the three tracks—MOSEv2, VOS, and RVOS—along with their evaluation metrics, datasets, and public benchmarking setup. It highlights top-performing solutions across tracks, including memory-guided approaches (Long-term Grounding Memory and Concept-aware Memory), LVLM-enhanced segmentation pipelines, and confidence-guided ensemble methods. A cross-track analysis reveals the growing importance of LLM/MLLM components and memory-aware propagation for robust, long-term segmentation under occlusion, clutter, and appearance changes, as well as the need for language-grounded, interactive capabilities. The findings indicate substantial headroom in MOSEv2 and emphasize practical implications for resilient, language-aware video understanding in the wild, with clear avenues for future research in memory mechanisms, semantic grounding, and cross-modal integration. Overall, MOSEv2 pushes models toward robust, long-horizon, language-informed VOS, while traditional VOS/RVOS tracks reinforce the value of model ensembles and memory-aware strategies in real-world scenes.

Abstract

This report presents an overview of the 7th Large-scale Video Object Segmentation (LSVOS) Challenge held in conjunction with ICCV 2025. Besides the two traditional tracks of LSVOS that jointly target robustness in realistic video scenarios: Classic VOS (VOS), and Referring VOS (RVOS), the 2025 edition features a newly introduced track, Complex VOS (MOSEv2). Building upon prior insights, MOSEv2 substantially increases difficulty, introducing more challenging but realistic scenarios including denser small objects, frequent disappear/reappear events, severe occlusions, adverse weather and lighting, etc., pushing long-term consistency and generalization beyond curated benchmarks. The challenge retains standard , , and metrics for VOS and RVOS, while MOSEv2 adopts as the primary ranking metric to better evaluate objects across scales and disappearance cases. We summarize datasets and protocols, highlight top-performing solutions, and distill emerging trends, such as the growing role of LLM/MLLM components and memory-aware propagation, aiming to chart future directions for resilient, language-aware video segmentation in the wild.

Paper Structure

This paper contains 20 sections, 17 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: The overall framework of SCOPE. The left figure (a) illustrates our overall ensemble pipeline, while the right figure (b) shows the fusion network of SAM2 and Cutie with the proposed Motion Prediction Module (MPM).
  • Figure 2: Framework of our method for the training stage.
  • Figure 3: Framework of our method for the inference stage.
  • Figure 4: Overview of our method: pseudo-label enhanced SAM2Long training and cascaded inference with SeC.
  • Figure 5: Overview of the proposed second stage of training pipeline, where only the memory attention module is fine-tuned during this process.
  • ...and 4 more figures