Dual-Stream Diffusion Net for Text-to-Video Generation
Binhui Liu, Xin Liu, Anbo Dai, Zhiyong Zeng, Dan Wang, Zhen Cui, Jian Yang
TL;DR
This work tackles flicker and instability in text-to-video generation by introducing the Dual-Stream Diffusion Net (DSDN), which separately models content and motion and aligns them with a cross-transformer. The content stream leverages a frozen image diffusion backbone with a LoRA-style content increment unit, while the motion stream employs a 3D U-Net diffusion to capture temporal dynamics; a dual-stream interaction module enables mutual conditioning. A motion decomposition and combination module further simplifies motion processing and fuses motion into the final latent representation before decoding. Experiments on a large video dataset show that DSDN improves frame-level coherence and textual alignment compared with strong baselines, demonstrating stronger visual continuity and diverse motion control for text-to-video generation.
Abstract
With the emerging diffusion models, recently, text-to-video generation has aroused increasing attention. But an important bottleneck therein is that generative videos often tend to carry some flickers and artifacts. In this work, we propose a dual-stream diffusion net (DSDN) to improve the consistency of content variations in generating videos. In particular, the designed two diffusion streams, video content and motion branches, could not only run separately in their private spaces for producing personalized video variations as well as content, but also be well-aligned between the content and motion domains through leveraging our designed cross-transformer interaction module, which would benefit the smoothness of generated videos. Besides, we also introduce motion decomposer and combiner to faciliate the operation on video motion. Qualitative and quantitative experiments demonstrate that our method could produce amazing continuous videos with fewer flickers.
