Table of Contents
Fetching ...

STSeg-Complex Video Object Segmentation: The 1st Solution for 4th PVUW MOSE Challenge

Kehuan Song, Xinglin Xie, Kexin Zhang, Licheng Jiao, Lingling Li, Shuyuan Yang

TL;DR

The STSeg solution achieved a J&F score of 87.26% on the test set of the 2025 4th PVUW Challenge MOSE Track, securing the 1st place and advancing the technology for video object segmentation in complex scenarios.

Abstract

Segmentation of video objects in complex scenarios is highly challenging, and the MOSE dataset has significantly contributed to the development of this field. This technical report details the STSeg solution proposed by the "imaplus" team.By finetuning SAM2 and the unsupervised model TMO on the MOSE dataset, the STSeg solution demonstrates remarkable advantages in handling complex object motions and long-video sequences. In the inference phase, an Adaptive Pseudo-labels Guided Model Refinement Pipeline is adopted to intelligently select appropriate models for processing each video. Through finetuning the models and employing the Adaptive Pseudo-labels Guided Model Refinement Pipeline in the inference phase, the STSeg solution achieved a J&F score of 87.26% on the test set of the 2025 4th PVUW Challenge MOSE Track, securing the 1st place and advancing the technology for video object segmentation in complex scenarios.

STSeg-Complex Video Object Segmentation: The 1st Solution for 4th PVUW MOSE Challenge

TL;DR

The STSeg solution achieved a J&F score of 87.26% on the test set of the 2025 4th PVUW Challenge MOSE Track, securing the 1st place and advancing the technology for video object segmentation in complex scenarios.

Abstract

Segmentation of video objects in complex scenarios is highly challenging, and the MOSE dataset has significantly contributed to the development of this field. This technical report details the STSeg solution proposed by the "imaplus" team.By finetuning SAM2 and the unsupervised model TMO on the MOSE dataset, the STSeg solution demonstrates remarkable advantages in handling complex object motions and long-video sequences. In the inference phase, an Adaptive Pseudo-labels Guided Model Refinement Pipeline is adopted to intelligently select appropriate models for processing each video. Through finetuning the models and employing the Adaptive Pseudo-labels Guided Model Refinement Pipeline in the inference phase, the STSeg solution achieved a J&F score of 87.26% on the test set of the 2025 4th PVUW Challenge MOSE Track, securing the 1st place and advancing the technology for video object segmentation in complex scenarios.

Paper Structure

This paper contains 17 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: (a) shows the original images of the MOSE test set. (b) presents that SAM2 without finetuning (large2.1 checkpoint) fails to track it successfully. (c) displays that SAM2 with fine-tuning can effectively track the occluded dragon boat when it reappears.
  • Figure 2: Overview of the PGMR Framework. Inference and Pseudo-Label-Based Model Selection on MOSE Test Set: Employing five models (SAM2, TMO, Cutie, XMem, and LiVOS) to conduct inference operations. Comprehensive pseudo-labels are constructed using the mask annotations from each model's inference. Based on these pseudo-labels, the model with optimal performance for different video contents is intelligently selected.
  • Figure 3: The inference outcomes of diverse models, along with our ultimate pseudo - label mask maps, are presented as follows. (a) represents the original RGB image. (b) is inferred by the TMO. (c) is the mask derived from the LiVOS .(d) is the mask obtained through the XMem. (e) is the mask inferred by the Cutie. (f) is the mask resulting from the inference of the SAM2. (g) is the average fusion mask map generated by converting the masks of the five models into bounding boxes and then performing an average fusion operation. (h) is the max fusion mask map. (i) is our pseudo-label.
  • Figure 4: (a) demonstrates the outstanding robustness of the solution, showcasing its stable tracking ability for the target under extreme observation conditions.(b) reflects the powerful effectiveness of this proposed solution in dealing with severe occlusion situations.(c) emphasizes that our solution can clearly distinguish between similar objects, and the generated masks are extremely accurate.(d) exhibits that the solution can accurately segment the driver from the entire vehicle.