Table of Contents
Fetching ...

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

Zixin Zhu, Xuelu Feng, Dongdong Chen, Junsong Yuan, Chunming Qiao, Gang Hua

TL;DR

Referring video object segmentation requires temporally coherent segmentation guided by natural language prompts. The authors propose VD-IT, a framework built on a fixed pretrained text-to-video diffusion model (ModelScopeT2V), employing Text-Guided Image Projection and a video-specific Noise Prediction module to extract diffusion features, paired with a deformable transformer-based mask head and Hungarian matching for segmentation. Ablation and cross-dataset experiments show that combining referring text with image tokens provides richer, temporally consistent features, yielding competitive or superior results on four standard R-VOS benchmarks. This work demonstrates that latent representations from generative T2V priors can rival discriminatively trained backbones, highlighting a new direction for unifying generative priors with discriminative tasks in video understanding.

Abstract

In this paper, we explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We hypothesize that the latent representation learned from a pretrained generative T2V model encapsulates rich semantics and coherent temporal correspondences, thereby naturally facilitating video understanding. Our hypothesis is validated through the classic referring video object segmentation (R-VOS) task. We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed pretrained T2V model. Specifically, VD-IT uses textual information as a conditional input, ensuring semantic consistency across time for precise temporal instance matching. It further incorporates image tokens as supplementary textual inputs, enriching the feature set to generate detailed and nuanced masks. Besides, instead of using the standard Gaussian noise, we propose to predict the video-specific noise with an extra noise prediction module, which can help preserve the feature fidelity and elevates segmentation quality. Through extensive experiments, we surprisingly observe that fixed generative T2V diffusion models, unlike commonly used video backbones (e.g., Video Swin Transformer) pretrained with discriminative image/video pre-tasks, exhibit better potential to maintain semantic alignment and temporal consistency. On existing standard benchmarks, our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods. The code is available at https://github.com/buxiangzhiren/VD-IT.

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

TL;DR

Referring video object segmentation requires temporally coherent segmentation guided by natural language prompts. The authors propose VD-IT, a framework built on a fixed pretrained text-to-video diffusion model (ModelScopeT2V), employing Text-Guided Image Projection and a video-specific Noise Prediction module to extract diffusion features, paired with a deformable transformer-based mask head and Hungarian matching for segmentation. Ablation and cross-dataset experiments show that combining referring text with image tokens provides richer, temporally consistent features, yielding competitive or superior results on four standard R-VOS benchmarks. This work demonstrates that latent representations from generative T2V priors can rival discriminatively trained backbones, highlighting a new direction for unifying generative priors with discriminative tasks in video understanding.

Abstract

In this paper, we explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We hypothesize that the latent representation learned from a pretrained generative T2V model encapsulates rich semantics and coherent temporal correspondences, thereby naturally facilitating video understanding. Our hypothesis is validated through the classic referring video object segmentation (R-VOS) task. We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed pretrained T2V model. Specifically, VD-IT uses textual information as a conditional input, ensuring semantic consistency across time for precise temporal instance matching. It further incorporates image tokens as supplementary textual inputs, enriching the feature set to generate detailed and nuanced masks. Besides, instead of using the standard Gaussian noise, we propose to predict the video-specific noise with an extra noise prediction module, which can help preserve the feature fidelity and elevates segmentation quality. Through extensive experiments, we surprisingly observe that fixed generative T2V diffusion models, unlike commonly used video backbones (e.g., Video Swin Transformer) pretrained with discriminative image/video pre-tasks, exhibit better potential to maintain semantic alignment and temporal consistency. On existing standard benchmarks, our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods. The code is available at https://github.com/buxiangzhiren/VD-IT.
Paper Structure (13 sections, 5 equations, 7 figures, 8 tables)

This paper contains 13 sections, 5 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Left: Analysis of learned features of existing methods that use discriminative backbone (Video Swin Transformer) and our methods (VD-I and VD-IT) that use fixed pretrained generative T2V model. Right: Temporal inconsistency in visual features will subsequently cause temporally inconsistent masks.
  • Figure 2: This framework comprises two core components: visual feature extraction and the mask segmentation head. The feature extraction progresses through three phases: (1) prompt generation via text-guided image projection conditions the text-to-video diffusion model; (2), predicted noise is applied to the video; (3) this noisy video and prompt are processed by a diffusion U-Net for visual feature extraction. The segmentation head generates instance queries from the text and merges them with the U-Net's features to create the final masks.
  • Figure 3: "VD-I" denotes Image-conditioned Video Diffusion based on the video clip. "VD-T" denotes Text-conditioned Video Diffusion based on the referring text, i.e., expression. "VD-IT" denotes Image-Text-conditioned Video Diffusion based on both video clip and referring text.
  • Figure 4: Temporal Semantic Consistency. Averaged over 1,000 samples from Ref-Youtube-VOS, the cosine similarity between the Region of Interest (RoI) features of the initial frame and the following seven frames is reported.
  • Figure 5: Robustness against light noise. We modify the brightness of various frames randomly and compare the IoU of segmentation results under changing lighting conditions. The results are reported on Ref-Youtube-VOS.
  • ...and 2 more figures