Dual-Stream Diffusion Net for Text-to-Video Generation

Binhui Liu; Xin Liu; Anbo Dai; Zhiyong Zeng; Dan Wang; Zhen Cui; Jian Yang

Dual-Stream Diffusion Net for Text-to-Video Generation

Binhui Liu, Xin Liu, Anbo Dai, Zhiyong Zeng, Dan Wang, Zhen Cui, Jian Yang

TL;DR

This work tackles flicker and instability in text-to-video generation by introducing the Dual-Stream Diffusion Net (DSDN), which separately models content and motion and aligns them with a cross-transformer. The content stream leverages a frozen image diffusion backbone with a LoRA-style content increment unit, while the motion stream employs a 3D U-Net diffusion to capture temporal dynamics; a dual-stream interaction module enables mutual conditioning. A motion decomposition and combination module further simplifies motion processing and fuses motion into the final latent representation before decoding. Experiments on a large video dataset show that DSDN improves frame-level coherence and textual alignment compared with strong baselines, demonstrating stronger visual continuity and diverse motion control for text-to-video generation.

Abstract

With the emerging diffusion models, recently, text-to-video generation has aroused increasing attention. But an important bottleneck therein is that generative videos often tend to carry some flickers and artifacts. In this work, we propose a dual-stream diffusion net (DSDN) to improve the consistency of content variations in generating videos. In particular, the designed two diffusion streams, video content and motion branches, could not only run separately in their private spaces for producing personalized video variations as well as content, but also be well-aligned between the content and motion domains through leveraging our designed cross-transformer interaction module, which would benefit the smoothness of generated videos. Besides, we also introduce motion decomposer and combiner to faciliate the operation on video motion. Qualitative and quantitative experiments demonstrate that our method could produce amazing continuous videos with fewer flickers.

Dual-Stream Diffusion Net for Text-to-Video Generation

TL;DR

Abstract

Paper Structure (21 sections, 7 equations, 7 figures, 1 table)

This paper contains 21 sections, 7 equations, 7 figures, 1 table.

Introduction
Related Work
Text-to-Image Generation
Text-to-Video Generation
Method
Overview
Forward Diffusion Process
Personalized Content Generation Stream
Personalized Motion Generation Stream
Dual-Stream Transformation Interaction
Motion Decomposition and Combination
Experiments
Implementation Details
Comparison with Baselines
Quantitative Comparison
...and 6 more sections

Figures (7)

Figure 1: Samples generated by our method.
Figure 2: DSDN network framework. Initially, Content and motion features are added to noise during the diffusion process, followed by a denoising step via the dual-stream diffusion net. Lastly, the latent space features of the generated video are obtained through the motion combiner and decoded to render the final generated video.
Figure 3: Dual-stream transformation block.
Figure 4: Details of Motion Decomposer and Motion Combiner.
Figure 5: Qualitative comparison between Text2Video-Zero Levon2023Zero (frames 1-4 in each row) and our method (frames 5-8 in each row). Please see the videos in the website.
...and 2 more figures

Dual-Stream Diffusion Net for Text-to-Video Generation

TL;DR

Abstract

Dual-Stream Diffusion Net for Text-to-Video Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)