ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation
Bo Peng, Xinyuan Chen, Yaohui Wang, Chaochao Lu, Yu Qiao
TL;DR
ConditionVideo presents a training-free approach to text-to-video generation that leverages off-the-shelf image diffusion models by disentangling motion into background and foreground components. It introduces a UNet branch for scenery motion and a 3D control branch for conditional guidance, augmented with sparse bi-directional spatial-temporal attention (sBiST-Attn) to improve temporal coherence. The method demonstrates state-of-the-art frame consistency, clip scores, and pose accuracy compared to Tune-A-Video, ControlNet, and Text2Video-Zero, with ablations validating the importance of pose conditioning, temporal modeling, and the 3D control branch. This work enables efficient, training-free generation of high-quality videos with controllable dynamics, offering practical benefits for AI-driven content creation and video synthesis pipelines.
Abstract
Recent works have successfully extended large-scale text-to-image models to the video domain, producing promising results but at a high computational cost and requiring a large amount of video data. In this work, we introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text, by leveraging the power of off-the-shelf text-to-image generation methods (e.g., Stable Diffusion). ConditionVideo generates realistic dynamic videos from random noise or given scene videos. Our method explicitly disentangles the motion representation into condition-guided and scenery motion components. To this end, the ConditionVideo model is designed with a UNet branch and a control branch. To improve temporal coherence, we introduce sparse bi-directional spatial-temporal attention (sBiST-Attn). The 3D control network extends the conventional 2D controlnet model, aiming to strengthen conditional generation accuracy by additionally leveraging the bi-directional frames in the temporal domain. Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods.
