PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild

Henghui Ding; Chang Liu; Nikhila Ravi; Shuting He; Yunchao Wei; Song Bai; Philip Torr; Kehuan Song; Xinglin Xie; Kexin Zhang; Licheng Jiao; Lingling Li; Shuyuan Yang; Xuqiang Cao; Linnan Zhao; Jiaxuan Zhao; Fang Liu; Mengjiao Wang; Junpei Zhang; Xu Liu; Yuting Yang; Mengru Ma; Hao Fang; Runmin Cong; Xiankai Lu; Zhiyang Chen; Wei Zhang; Tianming Liang; Haichao Jiang; Wei-Shi Zheng; Jian-Fang Hu; Haobo Yuan; Xiangtai Li; Tao Zhang; Lu Qi; Ming-Hsuan Yang

PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild

Henghui Ding, Chang Liu, Nikhila Ravi, Shuting He, Yunchao Wei, Song Bai, Philip Torr, Kehuan Song, Xinglin Xie, Kexin Zhang, Licheng Jiao, Lingling Li, Shuyuan Yang, Xuqiang Cao, Linnan Zhao, Jiaxuan Zhao, Fang Liu, Mengjiao Wang, Junpei Zhang, Xu Liu, Yuting Yang, Mengru Ma, Hao Fang, Runmin Cong, Xiankai Lu, Zhiyang Chen, Wei Zhang, Tianming Liang, Haichao Jiang, Wei-Shi Zheng, Jian-Fang Hu, Haobo Yuan, Xiangtai Li, Tao Zhang, Lu Qi, Ming-Hsuan Yang

TL;DR

This paper presents the PVUW 2025 Challenge Report, detailing two tracks—MOSE for complex video object segmentation and MeViS for motion-guided language-based segmentation—and their accompanying datasets and evaluation framework. It highlights top-performing solutions, including BrainyBots, DeepSegMa, and JIO for MOSE and MVP-Lab, ReferDINO-iSEE, and Sa2VA for MeViS, each leveraging large pretrained and multimodal models, data augmentation, and inference-time strategies to improve temporal coherence and cross-modal reasoning. A key theme is the growing role of LLMs and SAM-2-like segmentation in achieving robust, pixel-precise video understanding in unconstrained environments, supported by adaptive fusion, memory, and prompt-based techniques. The report emphasizes data quality, scalable model design, and ongoing dataset updates as essential levers for advancing real-world video understanding.

Abstract

This report provides a comprehensive overview of the 4th Pixel-level Video Understanding in the Wild (PVUW) Challenge, held in conjunction with CVPR 2025. It summarizes the challenge outcomes, participating methodologies, and future research directions. The challenge features two tracks: MOSE, which focuses on complex scene video object segmentation, and MeViS, which targets motion-guided, language-based video segmentation. Both tracks introduce new, more challenging datasets designed to better reflect real-world scenarios. Through detailed evaluation and analysis, the challenge offers valuable insights into the current state-of-the-art and emerging trends in complex video segmentation. More information can be found on the workshop website: https://pvuw.github.io/.

PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild

TL;DR

Abstract

PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)