LSVOS 2025 Challenge Report: Recent Advances in Complex Video Object Segmentation
Chang Liu, Henghui Ding, Kaining Ying, Lingyi Hong, Ning Xu, Linjie Yang, Yuchen Fan, Mingqi Gao, Jingkun Chen, Yunqi Miao, Gengshen Wu, Zhijin Qin, Jungong Han, Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Chang Soo Lim, Joonyoung Moon, Donghyeon Cho, Tingmin Li, Yixuan Li, Yang Yang, An Yan, Leilei Cao, Feng Lu, Ran Hong, Youhai Jiang, Fengjie Zhu, Yujie Xie, Hongyang Zhang, Zhihui Liu, Shihai Ruan, Quanzhu Niu, Dengxian Gong, Shihao Chen, Tao Zhang, Yikang Zhou, Haobo Yuan, Lu Qi, Xiangtai Li, Shunping Ji, Ran Hong, Feng Lu, Leilei Cao, An Yan, Alexey Nekrasov, Ali Athar, Daan de Geus, Alexander Hermans, Bastian Leibe
TL;DR
The paper surveys the LSVOS 2025 Challenge, introducing MOSEv2 as a significantly more challenging complex VOS benchmark and detailing the three tracks—MOSEv2, VOS, and RVOS—along with their evaluation metrics, datasets, and public benchmarking setup. It highlights top-performing solutions across tracks, including memory-guided approaches (Long-term Grounding Memory and Concept-aware Memory), LVLM-enhanced segmentation pipelines, and confidence-guided ensemble methods. A cross-track analysis reveals the growing importance of LLM/MLLM components and memory-aware propagation for robust, long-term segmentation under occlusion, clutter, and appearance changes, as well as the need for language-grounded, interactive capabilities. The findings indicate substantial headroom in MOSEv2 and emphasize practical implications for resilient, language-aware video understanding in the wild, with clear avenues for future research in memory mechanisms, semantic grounding, and cross-modal integration. Overall, MOSEv2 pushes models toward robust, long-horizon, language-informed VOS, while traditional VOS/RVOS tracks reinforce the value of model ensembles and memory-aware strategies in real-world scenes.
Abstract
This report presents an overview of the 7th Large-scale Video Object Segmentation (LSVOS) Challenge held in conjunction with ICCV 2025. Besides the two traditional tracks of LSVOS that jointly target robustness in realistic video scenarios: Classic VOS (VOS), and Referring VOS (RVOS), the 2025 edition features a newly introduced track, Complex VOS (MOSEv2). Building upon prior insights, MOSEv2 substantially increases difficulty, introducing more challenging but realistic scenarios including denser small objects, frequent disappear/reappear events, severe occlusions, adverse weather and lighting, etc., pushing long-term consistency and generalization beyond curated benchmarks. The challenge retains standard ${J}$, $F$, and ${J\&F}$ metrics for VOS and RVOS, while MOSEv2 adopts ${J\&\dot{F}}$ as the primary ranking metric to better evaluate objects across scales and disappearance cases. We summarize datasets and protocols, highlight top-performing solutions, and distill emerging trends, such as the growing role of LLM/MLLM components and memory-aware propagation, aiming to chart future directions for resilient, language-aware video segmentation in the wild.
