Object-Aware Video Matting with Cross-Frame Guidance
Huayu Zhang, Dongyue Wu, Yuanjie Shao, Nong Sang, Changxin Gao
TL;DR
This paper tackles trimap-free video matting by enabling explicit object understanding during processing of video frames. It introduces OAVM, combining pixel-level temporal features with object-level queries via an Object Guided Correction and Refinement (OGCR) module and a Sequential Foreground Merging augmentation. A memory bank stores past frame embeddings and cross-frame guidance uses the previous frame mask $S_{k-1}$ to steer pixel features toward foreground objects via $q_o$ and $q_{fb}$. Experiments on $RVM$, $VMFormer$, and $CRGNN$ show state-of-the-art performance with only an initial coarse mask, demonstrating strong object-aware matting and robust real-world generalization.
Abstract
Recently, trimap-free methods have drawn increasing attention in human video matting due to their promising performance. Nevertheless, these methods still suffer from the lack of deterministic foreground-background cues, which impairs their ability to consistently identify and locate foreground targets over time and mine fine-grained details. In this paper, we present a trimap-free Object-Aware Video Matting (OAVM) framework, which can perceive different objects, enabling joint recognition of foreground objects and refinement of edge details. Specifically, we propose an Object-Guided Correction and Refinement (OGCR) module, which employs cross-frame guidance to aggregate object-level instance information into pixel-level detail features, thereby promoting their synergy. Furthermore, we design a Sequential Foreground Merging augmentation strategy to diversify sequential scenarios and enhance capacity of the network for object discrimination. Extensive experiments on recent widely used synthetic and real-world benchmarks demonstrate the state-of-the-art performance of our OAVM with only an initial coarse mask. The code and model will be available.
