Table of Contents
Fetching ...

Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended?

Chen Liang, Qiang Guo, Xiaochao Qu, Luoqi Liu, Ting Liu

TL;DR

This work tackles the instability and limited generalization of video segmentation by introducing Masked Video Consistency (MVC) and Object Masked Attention (OMA) within a decoupled video segmentation framework. MVC applies strategic masking to inputs and features to compel the model to predict full semantic segmentation using broader spatial-temporal context, while OMA modulates cross-attention to downweight irrelevant background queries and strengthen temporal modeling. Across five datasets and three tasks (VPS, VSS, VIS), the approach achieves state-of-the-art results without increasing model parameters, highlighting the value of auxiliary masking cues in supervised training. The findings advance practical video segmentation by improving frame-to-frame consistency and segmentation accuracy in challenging scenarios such as occlusions and class imbalance, with broad implications for real-world applications in video understanding.

Abstract

Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames. Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets. This leads to inconsistent segmentation results across frames. To address these issues, we propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation. MVC introduces a training strategy that randomly masks image patches, compelling the network to predict the entire semantic segmentation, thus improving contextual information integration. Additionally, we introduce Object Masked Attention (OMA) to optimize the cross-attention mechanism by reducing the impact of irrelevant queries, thereby enhancing temporal modeling capabilities. Our approach, integrated into the latest decoupled universal video segmentation framework, achieves state-of-the-art performance across five datasets for three video segmentation tasks, demonstrating significant improvements over previous methods without increasing model parameters.

Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended?

TL;DR

This work tackles the instability and limited generalization of video segmentation by introducing Masked Video Consistency (MVC) and Object Masked Attention (OMA) within a decoupled video segmentation framework. MVC applies strategic masking to inputs and features to compel the model to predict full semantic segmentation using broader spatial-temporal context, while OMA modulates cross-attention to downweight irrelevant background queries and strengthen temporal modeling. Across five datasets and three tasks (VPS, VSS, VIS), the approach achieves state-of-the-art results without increasing model parameters, highlighting the value of auxiliary masking cues in supervised training. The findings advance practical video segmentation by improving frame-to-frame consistency and segmentation accuracy in challenging scenarios such as occlusions and class imbalance, with broad implications for real-world applications in video understanding.

Abstract

Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames. Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets. This leads to inconsistent segmentation results across frames. To address these issues, we propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation. MVC introduces a training strategy that randomly masks image patches, compelling the network to predict the entire semantic segmentation, thus improving contextual information integration. Additionally, we introduce Object Masked Attention (OMA) to optimize the cross-attention mechanism by reducing the impact of irrelevant queries, thereby enhancing temporal modeling capabilities. Our approach, integrated into the latest decoupled universal video segmentation framework, achieves state-of-the-art performance across five datasets for three video segmentation tasks, demonstrating significant improvements over previous methods without increasing model parameters.
Paper Structure (34 sections, 11 equations, 11 figures, 6 tables)

This paper contains 34 sections, 11 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Our method outperforms other methods in VPS, VSS, and VIS tasks, validated on five different datasets. All improvements have not added any additional model parameters.
  • Figure 2: The overall decoupled video segmentation framework equipped with Masked Video Consistency (MVC) and Object Masked Attention (OMA). Patch masking and query masking use colored blocks to indicate elements retained after element-wise masking, while uncolored areas indicate removal. In the object query diagrams, the colored cubes represent foreground object queries, and the uncolored dashed cubes represent background object queries.
  • Figure 3: Visualization of different masking strategy for MVC-1 and MVC-2. The red blocks represent the patches of the image that are masked out.
  • Figure 4: The mechanism of Object Masked Attention. $Q_{Seg}$ represents the query output of the Segmenter. $Q_{RT}$ represents the query output of the Tracker. Different colors indicate different objects, while dashed shapes represents background object queries.
  • Figure 5: Qualitative comparison for baseline model and our model. The white dashed box indicates the segmentation problem in the baseline model.
  • ...and 6 more figures