Semantic Segmentation on VSPW Dataset through Masked Video Consistency

Chen Liang; Qiang Guo; Chongkai Yu; Chengjing Wu; Ting Liu; Luoqi Liu

Semantic Segmentation on VSPW Dataset through Masked Video Consistency

Chen Liang, Qiang Guo, Chongkai Yu, Chengjing Wu, Ting Liu, Luoqi Liu

TL;DR

This work tackles pixel-level video segmentation on the VSPW dataset by enforcing spatiotemporal coherence through Masked Video Consistency (MVC), a plug-in to the DVIS++ baseline. The approach is augmented with test-time augmentation, model aggregation based on optical-flow–driven temporal consistency, and multimodal post-processing to correct challenging cases. Empirically, it achieves $67.27\%$ mIoU on the VSPW test set and strong Video Consistency scores, earning 2nd place in PVUW2024 VSS; the MVC variant also reaches $VC_8 \approx 95\%$ and $VC_{16} \approx 93\%$. These results demonstrate improved boundary precision and stable segmentation across frames, with practical impact for real-world video understanding tasks using diverse inputs and cross-domain post-processing.

Abstract

Pixel-level Video Understanding requires effectively integrating three-dimensional data in both spatial and temporal dimensions to learn accurate and stable semantic information from continuous frames. However, existing advanced models on the VSPW dataset have not fully modeled spatiotemporal relationships. In this paper, we present our solution for the PVUW competition, where we introduce masked video consistency (MVC) based on existing models. MVC enforces the consistency between predictions of masked frames where random patches are withheld. The model needs to learn the segmentation results of the masked parts through the context of images and the relationship between preceding and succeeding frames of the video. Additionally, we employed test-time augmentation, model aggeregation and a multimodal model-based post-processing method. Our approach achieves 67.27% mIoU performance on the VSPW dataset, ranking 2nd place in the PVUW2024 challenge VSS track.

Semantic Segmentation on VSPW Dataset through Masked Video Consistency

TL;DR

mIoU on the VSPW test set and strong Video Consistency scores, earning 2nd place in PVUW2024 VSS; the MVC variant also reaches

and

. These results demonstrate improved boundary precision and stable segmentation across frames, with practical impact for real-world video understanding tasks using diverse inputs and cross-domain post-processing.

Abstract

Paper Structure (18 sections, 8 equations, 6 tables)

This paper contains 18 sections, 8 equations, 6 tables.

Introduction
The proposed method
Baseline Model
Masked Video Consistency
Model aggregation
Test-time augmentation
Post-processing
Experiments
Datasets and evaluation metrics
Implementation details
Ablation studies
Ablation study of MVC.
Ablation study of extra training data.
Ablation study of test-time augmentation.
Ablation study of model aggregation
...and 3 more sections

Semantic Segmentation on VSPW Dataset through Masked Video Consistency

TL;DR

Abstract

Semantic Segmentation on VSPW Dataset through Masked Video Consistency

Authors

TL;DR

Abstract

Table of Contents