Table of Contents
Fetching ...

UniVG: Towards UNIfied-modal Video Generation

Ludan Ruan, Lei Tian, Chuanwei Huang, Xu Zhang, Xinyan Xiao

TL;DR

UniVG addresses the need for flexible, multi-task video generation conditioned on text and images. It introduces a unified framework with a Base model for high-freedom generation and two low-freedom branches (Image Animation and Super-Resolution), coupled with Multi-condition Cross Attention and Biased Gaussian Noise to bridge training and inference. The approach yields strong objective performance on MSR-VTT and competitive human judgments, highlighting practical impact for real-world multimodal video creation. Overall, UniVG advances unified, cross-modal video generation with a scalable architecture and principled handling of conditioning diversity.

Abstract

Diffusion based video generation has received extensive attention and achieved considerable success within both the academic and industrial communities. However, current efforts are mainly concentrated on single-objective or single-task video generation, such as generation driven by text, by image, or by a combination of text and image. This cannot fully meet the needs of real-world application scenarios, as users are likely to input images and text conditions in a flexible manner, either individually or in combination. To address this, we propose a Unified-modal Video Genearation system that is capable of handling multiple video generation tasks across text and image modalities. To this end, we revisit the various video generation tasks within our system from the perspective of generative freedom, and classify them into high-freedom and low-freedom video generation categories. For high-freedom video generation, we employ Multi-condition Cross Attention to generate videos that align with the semantics of the input images or text. For low-freedom video generation, we introduce Biased Gaussian Noise to replace the pure random Gaussian Noise, which helps to better preserve the content of the input conditions. Our method achieves the lowest Fréchet Video Distance (FVD) on the public academic benchmark MSR-VTT, surpasses the current open-source methods in human evaluations, and is on par with the current close-source method Gen2. For more samples, visit https://univg-baidu.github.io.

UniVG: Towards UNIfied-modal Video Generation

TL;DR

UniVG addresses the need for flexible, multi-task video generation conditioned on text and images. It introduces a unified framework with a Base model for high-freedom generation and two low-freedom branches (Image Animation and Super-Resolution), coupled with Multi-condition Cross Attention and Biased Gaussian Noise to bridge training and inference. The approach yields strong objective performance on MSR-VTT and competitive human judgments, highlighting practical impact for real-world multimodal video creation. Overall, UniVG advances unified, cross-modal video generation with a scalable architecture and principled handling of conditioning diversity.

Abstract

Diffusion based video generation has received extensive attention and achieved considerable success within both the academic and industrial communities. However, current efforts are mainly concentrated on single-objective or single-task video generation, such as generation driven by text, by image, or by a combination of text and image. This cannot fully meet the needs of real-world application scenarios, as users are likely to input images and text conditions in a flexible manner, either individually or in combination. To address this, we propose a Unified-modal Video Genearation system that is capable of handling multiple video generation tasks across text and image modalities. To this end, we revisit the various video generation tasks within our system from the perspective of generative freedom, and classify them into high-freedom and low-freedom video generation categories. For high-freedom video generation, we employ Multi-condition Cross Attention to generate videos that align with the semantics of the input images or text. For low-freedom video generation, we introduce Biased Gaussian Noise to replace the pure random Gaussian Noise, which helps to better preserve the content of the input conditions. Our method achieves the lowest Fréchet Video Distance (FVD) on the public academic benchmark MSR-VTT, surpasses the current open-source methods in human evaluations, and is on par with the current close-source method Gen2. For more samples, visit https://univg-baidu.github.io.
Paper Structure (22 sections, 3 equations, 7 figures, 3 tables)

This paper contains 22 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: UniVG is a unified video generation framework that supports various video generation tasks, such as Text-to-Video, Image-to-Video, and Text&Image-to-Video. Here displays two sets of examples. Row 1: Input text to generate semantically consistent videos; Row 2: Input image to produce pixel-aligned videos; Row 3: Combine the semantic of input text and image to create semantically aligned videos. All videos are shown on https://univg-baidu.github.io.
  • Figure 2: Overview of the proposed UniVG system. (a) displays the whole pipeline of UniVG, which includes the Base Model $\mathcal{F}_B$, the Animation model $\mathcal{F}_A$, and the Super Resolution model $\mathcal{F}_{SR}$. (b) illustrates the Multi-condition Cross Attention involved in $\mathcal{F}_B$ and $\mathcal{F}_A$.
  • Figure 3: The forward & backward diffusion process with Random Gaussian Noise and Biased Gaussian Noise.
  • Figure 4: Percentage(%) of Overall Preference of UniVG-LG generated videos compared with other SOTA methods.
  • Figure 5: FVD Scores on MSR-VTT during the Training Process of $\mathcal{F}_B$.
  • ...and 2 more figures