Table of Contents
Fetching ...

Object-Aware Video Matting with Cross-Frame Guidance

Huayu Zhang, Dongyue Wu, Yuanjie Shao, Nong Sang, Changxin Gao

TL;DR

This paper tackles trimap-free video matting by enabling explicit object understanding during processing of video frames. It introduces OAVM, combining pixel-level temporal features with object-level queries via an Object Guided Correction and Refinement (OGCR) module and a Sequential Foreground Merging augmentation. A memory bank stores past frame embeddings and cross-frame guidance uses the previous frame mask $S_{k-1}$ to steer pixel features toward foreground objects via $q_o$ and $q_{fb}$. Experiments on $RVM$, $VMFormer$, and $CRGNN$ show state-of-the-art performance with only an initial coarse mask, demonstrating strong object-aware matting and robust real-world generalization.

Abstract

Recently, trimap-free methods have drawn increasing attention in human video matting due to their promising performance. Nevertheless, these methods still suffer from the lack of deterministic foreground-background cues, which impairs their ability to consistently identify and locate foreground targets over time and mine fine-grained details. In this paper, we present a trimap-free Object-Aware Video Matting (OAVM) framework, which can perceive different objects, enabling joint recognition of foreground objects and refinement of edge details. Specifically, we propose an Object-Guided Correction and Refinement (OGCR) module, which employs cross-frame guidance to aggregate object-level instance information into pixel-level detail features, thereby promoting their synergy. Furthermore, we design a Sequential Foreground Merging augmentation strategy to diversify sequential scenarios and enhance capacity of the network for object discrimination. Extensive experiments on recent widely used synthetic and real-world benchmarks demonstrate the state-of-the-art performance of our OAVM with only an initial coarse mask. The code and model will be available.

Object-Aware Video Matting with Cross-Frame Guidance

TL;DR

This paper tackles trimap-free video matting by enabling explicit object understanding during processing of video frames. It introduces OAVM, combining pixel-level temporal features with object-level queries via an Object Guided Correction and Refinement (OGCR) module and a Sequential Foreground Merging augmentation. A memory bank stores past frame embeddings and cross-frame guidance uses the previous frame mask to steer pixel features toward foreground objects via and . Experiments on , , and show state-of-the-art performance with only an initial coarse mask, demonstrating strong object-aware matting and robust real-world generalization.

Abstract

Recently, trimap-free methods have drawn increasing attention in human video matting due to their promising performance. Nevertheless, these methods still suffer from the lack of deterministic foreground-background cues, which impairs their ability to consistently identify and locate foreground targets over time and mine fine-grained details. In this paper, we present a trimap-free Object-Aware Video Matting (OAVM) framework, which can perceive different objects, enabling joint recognition of foreground objects and refinement of edge details. Specifically, we propose an Object-Guided Correction and Refinement (OGCR) module, which employs cross-frame guidance to aggregate object-level instance information into pixel-level detail features, thereby promoting their synergy. Furthermore, we design a Sequential Foreground Merging augmentation strategy to diversify sequential scenarios and enhance capacity of the network for object discrimination. Extensive experiments on recent widely used synthetic and real-world benchmarks demonstrate the state-of-the-art performance of our OAVM with only an initial coarse mask. The code and model will be available.

Paper Structure

This paper contains 26 sections, 10 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: We explore endowing the network with the ability to jointly recognize foreground targets and refine edge details without trimaps. As illustrated in the example, our method consistently delivers outstanding performance across challenging video cases, whether it is with similar foreground and background colors or very detailed hair.
  • Figure 2: (a) The overview of the proposed OAVM framework. Pixel-level Feature Extraction (PFE) and Object-level Query Generation (OQG) are implemented to yield features with temporal coherence and queries embedding object-level instances, respectively. Subsequently, the two are integrated by the Object-Guided Correction and Refinement (OGCR) module, where the detailed features are further refined, ultimately producing the alpha matte and foreground mask of the current frame. The instance mask output is used to optimize the network during the training phase. (b) Illustration of the Object-Guided Correction and Refinement (OGCR) module.
  • Figure 3: Qualitative comparisons on on the RVM benchmark. Our OAVM yields excellent results in both edge detail and center regions. Please zoom in for a better view.
  • Figure 4: Visual comparisons on the real-wold scenarios. Frames in the first two rows originate from the CRGNN benchmark, and those in the subsequent two rows are sourced from DAVIS and MOSE datasets. Please zoom in for a better view.
  • Figure 5: Visualization of attention map in the OGCR module. Top: without the cross-frame guidance. Bottom: with the cross-frame guidance. The model can accurately localize foreground objects and handle multi-object scenes with cross-frame guidance
  • ...and 1 more figures