Table of Contents
Fetching ...

VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion

Lehan Yang, Jincen Song, Tianlong Wang, Daiqing Qi, Weili Shi, Yuheng Liu, Sheng Li

TL;DR

The paper defines video referring matting, where a natural language caption specifies the target object in a video and an alpha matte is produced for each frame. It introduces VRM-10K, a large-scale, captioned video matting dataset, and VRMDiff, a diffusion-based framework that fuses video conditioning and text guidance to generate temporally coherent mattes without mask guidance. To handle overlapping instances and align with descriptions, it applies latent contrastive learning in a 3D VAE latent space via Latent-InfoNCE, integrated with the diffusion objective. Experiments on VRM-10K show substantial gains in matting quality and temporal stability over baselines, highlighting the method's practicality for caption-driven video editing.

Abstract

We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption. We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior of video diffusion models to generate alpha mattes that are temporally coherent and closely related to the corresponding semantic instances. Moreover, we propose a new Latent-Constructive loss to further distinguish different instances, enabling more controllable interactive matting. Additionally, we introduce a large-scale video referring matting dataset with 10,000 videos. To the best of our knowledge, this is the first dataset that concurrently contains captions, videos, and instance-level alpha mattes. Extensive experiments demonstrate the effectiveness of our method. The dataset and code are available at https://github.com/Hansxsourse/VRMDiff.

VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion

TL;DR

The paper defines video referring matting, where a natural language caption specifies the target object in a video and an alpha matte is produced for each frame. It introduces VRM-10K, a large-scale, captioned video matting dataset, and VRMDiff, a diffusion-based framework that fuses video conditioning and text guidance to generate temporally coherent mattes without mask guidance. To handle overlapping instances and align with descriptions, it applies latent contrastive learning in a 3D VAE latent space via Latent-InfoNCE, integrated with the diffusion objective. Experiments on VRM-10K show substantial gains in matting quality and temporal stability over baselines, highlighting the method's practicality for caption-driven video editing.

Abstract

We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption. We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior of video diffusion models to generate alpha mattes that are temporally coherent and closely related to the corresponding semantic instances. Moreover, we propose a new Latent-Constructive loss to further distinguish different instances, enabling more controllable interactive matting. Additionally, we introduce a large-scale video referring matting dataset with 10,000 videos. To the best of our knowledge, this is the first dataset that concurrently contains captions, videos, and instance-level alpha mattes. Extensive experiments demonstrate the effectiveness of our method. The dataset and code are available at https://github.com/Hansxsourse/VRMDiff.

Paper Structure

This paper contains 19 sections, 15 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Visualization results of referring matting method VRMDiff. We treat video matting as a generation task with referring capabilities. By inputting a caption that describes the instance, our model outputs the corresponding instance's matte.
  • Figure 2: Data pipeline. For the background videos, we sample from DVM, and for the foreground instances, we sample from VideoMatte240K. The foreground instances are composited onto the background videos. The instance-level captions are generated by the vision-language model Tarsier, using the matte-extracted instance videos as input.
  • Figure 3: Overview of VRMDiff framework. Similar to CogVideoX, our method performs denoising in the latent space using a 3D VAE. We input the conditional video and the corresponding referring caption, and the model is expected to output the alpha matte decoded into RGB space by the 3D VAE. Besides the diffusion loss, we employ a latent contrastive loss. During training, we use the model's latent output as the anchor, the ground truth matte as the positive sample, and instance mattes not corresponding to the caption as negative samples. This approach enhances the model's ability to distinguish instances and improves text-to-instance alignment.
  • Figure 4: Single instance matting qualitative results on VRMDiff. Six frames are evenly sampled from the video, with the horizontal axis representing time and the frame index gradually increasing. From top to bottom, the sequence is the input video, the output alpha matte, and the extracted instance obtained by applying the matte to the video. The prompts are A man in a dark suit and tie and The man is wearing a dark blue suit jacket, a white shirt, and black headphones.
  • Figure 5: Multiple instance matting qualitative results on VRMDiff. Six frames are evenly sampled from the video, with the horizontal axis representing time and the frame index gradually increasing. We use masks of different colors to represent the alpha mattes of different instances.
  • ...and 3 more figures