Table of Contents
Fetching ...

Advancing Complex Video Object Segmentation via Tracking-Enhanced Prompt: The 1st Winner for 5th PVUW MOSE Challenge

Jinrong Zhang, Canyang Wu, Xusheng He, Weili Guan, Jianlong Wu, Liqiang Nie

Abstract

In the Complex Video Object Segmentation task, researchers are required to track and segment specific targets within cluttered environments, which rigorously tests a method's capability for target comprehension and environmental adaptability. Although SAM3, the current state-of-the-art solution, exhibits unparalleled segmentation performance and robustness on conventional targets, it underperforms on tiny and semantic-dominated objects. The root cause of this limitation lies in SAM3's insufficient comprehension of these specific target types. To address this issue, we propose TEP: Advancing Complex Video Object Segmentation via Tracking-Enhanced Prompts. As a training-free approach, TEP leverages external tracking models and Multimodal Large Language Models to introduce tracking-enhanced prompts, thereby alleviating the difficulty SAM3 faces in understanding these challenging targets. Our method achieved first place (56.91%) on the test set of the PVUW Challenge 2026: Complex Video Object Segmentation Track.

Advancing Complex Video Object Segmentation via Tracking-Enhanced Prompt: The 1st Winner for 5th PVUW MOSE Challenge

Abstract

In the Complex Video Object Segmentation task, researchers are required to track and segment specific targets within cluttered environments, which rigorously tests a method's capability for target comprehension and environmental adaptability. Although SAM3, the current state-of-the-art solution, exhibits unparalleled segmentation performance and robustness on conventional targets, it underperforms on tiny and semantic-dominated objects. The root cause of this limitation lies in SAM3's insufficient comprehension of these specific target types. To address this issue, we propose TEP: Advancing Complex Video Object Segmentation via Tracking-Enhanced Prompts. As a training-free approach, TEP leverages external tracking models and Multimodal Large Language Models to introduce tracking-enhanced prompts, thereby alleviating the difficulty SAM3 faces in understanding these challenging targets. Our method achieved first place (56.91%) on the test set of the PVUW Challenge 2026: Complex Video Object Segmentation Track.

Paper Structure

This paper contains 11 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Visualization of the three target types in the MOSE v2 dataset, highlighting the challenges posed by tiny targets and semantic-dominated targets in complex environments. Tiny targets are visually minuscule and easily overwhelmed by background interference, while semantic-dominated targets possess highly similar visual features that can be easily confused with intra-class instances.
  • Figure 2: The architecture of the proposed TEP framework, which consists of three main stages: Target Classification, Tracking Enhancement, and Prompt Fusion. The Target Classification stage categorizes video targets into Regular, Tiny, and Semantic-Dominated types using mask area calculations and MLLM. The Tracking Enhancement stage employs tracking methods for Tiny Targets and MLLM for Semantic-Dominated Targets to generate bounding box prompts. The Prompt Fusion stage dynamically integrates these prompts into the SAM3 pipeline based on IoU and confidence scores to enhance segmentation accuracy and stability.
  • Figure 3: Qualitative visualization of segmentation results on the MOSE v2 test set. The visualizations demonstrate that the tracking and segmentation contours for both tiny and semantic-dominated targets remain clean, tight, and temporally stable, validating the efficacy of our prompt-enhancement strategy.