VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion
Lehan Yang, Jincen Song, Tianlong Wang, Daiqing Qi, Weili Shi, Yuheng Liu, Sheng Li
TL;DR
The paper defines video referring matting, where a natural language caption specifies the target object in a video and an alpha matte is produced for each frame. It introduces VRM-10K, a large-scale, captioned video matting dataset, and VRMDiff, a diffusion-based framework that fuses video conditioning and text guidance to generate temporally coherent mattes without mask guidance. To handle overlapping instances and align with descriptions, it applies latent contrastive learning in a 3D VAE latent space via Latent-InfoNCE, integrated with the diffusion objective. Experiments on VRM-10K show substantial gains in matting quality and temporal stability over baselines, highlighting the method's practicality for caption-driven video editing.
Abstract
We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption. We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior of video diffusion models to generate alpha mattes that are temporally coherent and closely related to the corresponding semantic instances. Moreover, we propose a new Latent-Constructive loss to further distinguish different instances, enabling more controllable interactive matting. Additionally, we introduce a large-scale video referring matting dataset with 10,000 videos. To the best of our knowledge, this is the first dataset that concurrently contains captions, videos, and instance-level alpha mattes. Extensive experiments demonstrate the effectiveness of our method. The dataset and code are available at https://github.com/Hansxsourse/VRMDiff.
