Table of Contents
Fetching ...

PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild

Henghui Ding, Chang Liu, Nikhila Ravi, Shuting He, Yunchao Wei, Song Bai, Philip Torr, Kehuan Song, Xinglin Xie, Kexin Zhang, Licheng Jiao, Lingling Li, Shuyuan Yang, Xuqiang Cao, Linnan Zhao, Jiaxuan Zhao, Fang Liu, Mengjiao Wang, Junpei Zhang, Xu Liu, Yuting Yang, Mengru Ma, Hao Fang, Runmin Cong, Xiankai Lu, Zhiyang Chen, Wei Zhang, Tianming Liang, Haichao Jiang, Wei-Shi Zheng, Jian-Fang Hu, Haobo Yuan, Xiangtai Li, Tao Zhang, Lu Qi, Ming-Hsuan Yang

TL;DR

This paper presents the PVUW 2025 Challenge Report, detailing two tracks—MOSE for complex video object segmentation and MeViS for motion-guided language-based segmentation—and their accompanying datasets and evaluation framework. It highlights top-performing solutions, including BrainyBots, DeepSegMa, and JIO for MOSE and MVP-Lab, ReferDINO-iSEE, and Sa2VA for MeViS, each leveraging large pretrained and multimodal models, data augmentation, and inference-time strategies to improve temporal coherence and cross-modal reasoning. A key theme is the growing role of LLMs and SAM-2-like segmentation in achieving robust, pixel-precise video understanding in unconstrained environments, supported by adaptive fusion, memory, and prompt-based techniques. The report emphasizes data quality, scalable model design, and ongoing dataset updates as essential levers for advancing real-world video understanding.

Abstract

This report provides a comprehensive overview of the 4th Pixel-level Video Understanding in the Wild (PVUW) Challenge, held in conjunction with CVPR 2025. It summarizes the challenge outcomes, participating methodologies, and future research directions. The challenge features two tracks: MOSE, which focuses on complex scene video object segmentation, and MeViS, which targets motion-guided, language-based video segmentation. Both tracks introduce new, more challenging datasets designed to better reflect real-world scenarios. Through detailed evaluation and analysis, the challenge offers valuable insights into the current state-of-the-art and emerging trends in complex video segmentation. More information can be found on the workshop website: https://pvuw.github.io/.

PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild

TL;DR

This paper presents the PVUW 2025 Challenge Report, detailing two tracks—MOSE for complex video object segmentation and MeViS for motion-guided language-based segmentation—and their accompanying datasets and evaluation framework. It highlights top-performing solutions, including BrainyBots, DeepSegMa, and JIO for MOSE and MVP-Lab, ReferDINO-iSEE, and Sa2VA for MeViS, each leveraging large pretrained and multimodal models, data augmentation, and inference-time strategies to improve temporal coherence and cross-modal reasoning. A key theme is the growing role of LLMs and SAM-2-like segmentation in achieving robust, pixel-precise video understanding in unconstrained environments, supported by adaptive fusion, memory, and prompt-based techniques. The report emphasizes data quality, scalable model design, and ongoing dataset updates as essential levers for advancing real-world video understanding.

Abstract

This report provides a comprehensive overview of the 4th Pixel-level Video Understanding in the Wild (PVUW) Challenge, held in conjunction with CVPR 2025. It summarizes the challenge outcomes, participating methodologies, and future research directions. The challenge features two tracks: MOSE, which focuses on complex scene video object segmentation, and MeViS, which targets motion-guided, language-based video segmentation. Both tracks introduce new, more challenging datasets designed to better reflect real-world scenarios. Through detailed evaluation and analysis, the challenge offers valuable insights into the current state-of-the-art and emerging trends in complex video segmentation. More information can be found on the workshop website: https://pvuw.github.io/.

Paper Structure

This paper contains 13 sections, 4 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of the PGMR Framework. Inference and Pseudo-Label-Based Model Selection: Employing five models to conduct inference operations, and the model with optimal performance for different video contents is intelligently selected.
  • Figure 2: Overview of Team DeepSegMa's method.
  • Figure 3: Network Architecture of FVOS.
  • Figure 4: Test time data augmentation and multi-scale magnification operations. (a) original image. (b) clockwise by 90$^\circ$. (c) clockwise by 180$^\circ$. (d) clockwise by 270$^\circ$. (e) horizontal flipping. (f) multi-scale magnification.
  • Figure 5: The architecture of Sa2VA yuan2025sa2va. The model first encodes the input texts, visual prompts, images, and videos into token embeddings. These tokens are then processed through a large language model (LLM). The output text tokens are used to generate the "[SEG]" token and associated language outputs. The SAM 2 decoder receives the image and video features from the SAM 2 encoder, along with the "[SEG]" token, to generate corresponding image and video masks.
  • ...and 1 more figures