Table of Contents
Fetching ...

Semantic Segmentation on VSPW Dataset through Masked Video Consistency

Chen Liang, Qiang Guo, Chongkai Yu, Chengjing Wu, Ting Liu, Luoqi Liu

TL;DR

This work tackles pixel-level video segmentation on the VSPW dataset by enforcing spatiotemporal coherence through Masked Video Consistency (MVC), a plug-in to the DVIS++ baseline. The approach is augmented with test-time augmentation, model aggregation based on optical-flow–driven temporal consistency, and multimodal post-processing to correct challenging cases. Empirically, it achieves $67.27\%$ mIoU on the VSPW test set and strong Video Consistency scores, earning 2nd place in PVUW2024 VSS; the MVC variant also reaches $VC_8 \approx 95\%$ and $VC_{16} \approx 93\%$. These results demonstrate improved boundary precision and stable segmentation across frames, with practical impact for real-world video understanding tasks using diverse inputs and cross-domain post-processing.

Abstract

Pixel-level Video Understanding requires effectively integrating three-dimensional data in both spatial and temporal dimensions to learn accurate and stable semantic information from continuous frames. However, existing advanced models on the VSPW dataset have not fully modeled spatiotemporal relationships. In this paper, we present our solution for the PVUW competition, where we introduce masked video consistency (MVC) based on existing models. MVC enforces the consistency between predictions of masked frames where random patches are withheld. The model needs to learn the segmentation results of the masked parts through the context of images and the relationship between preceding and succeeding frames of the video. Additionally, we employed test-time augmentation, model aggeregation and a multimodal model-based post-processing method. Our approach achieves 67.27% mIoU performance on the VSPW dataset, ranking 2nd place in the PVUW2024 challenge VSS track.

Semantic Segmentation on VSPW Dataset through Masked Video Consistency

TL;DR

This work tackles pixel-level video segmentation on the VSPW dataset by enforcing spatiotemporal coherence through Masked Video Consistency (MVC), a plug-in to the DVIS++ baseline. The approach is augmented with test-time augmentation, model aggregation based on optical-flow–driven temporal consistency, and multimodal post-processing to correct challenging cases. Empirically, it achieves mIoU on the VSPW test set and strong Video Consistency scores, earning 2nd place in PVUW2024 VSS; the MVC variant also reaches and . These results demonstrate improved boundary precision and stable segmentation across frames, with practical impact for real-world video understanding tasks using diverse inputs and cross-domain post-processing.

Abstract

Pixel-level Video Understanding requires effectively integrating three-dimensional data in both spatial and temporal dimensions to learn accurate and stable semantic information from continuous frames. However, existing advanced models on the VSPW dataset have not fully modeled spatiotemporal relationships. In this paper, we present our solution for the PVUW competition, where we introduce masked video consistency (MVC) based on existing models. MVC enforces the consistency between predictions of masked frames where random patches are withheld. The model needs to learn the segmentation results of the masked parts through the context of images and the relationship between preceding and succeeding frames of the video. Additionally, we employed test-time augmentation, model aggeregation and a multimodal model-based post-processing method. Our approach achieves 67.27% mIoU performance on the VSPW dataset, ranking 2nd place in the PVUW2024 challenge VSS track.
Paper Structure (18 sections, 8 equations, 6 tables)