Table of Contents
Fetching ...

ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation

Bo Peng, Xinyuan Chen, Yaohui Wang, Chaochao Lu, Yu Qiao

TL;DR

ConditionVideo presents a training-free approach to text-to-video generation that leverages off-the-shelf image diffusion models by disentangling motion into background and foreground components. It introduces a UNet branch for scenery motion and a 3D control branch for conditional guidance, augmented with sparse bi-directional spatial-temporal attention (sBiST-Attn) to improve temporal coherence. The method demonstrates state-of-the-art frame consistency, clip scores, and pose accuracy compared to Tune-A-Video, ControlNet, and Text2Video-Zero, with ablations validating the importance of pose conditioning, temporal modeling, and the 3D control branch. This work enables efficient, training-free generation of high-quality videos with controllable dynamics, offering practical benefits for AI-driven content creation and video synthesis pipelines.

Abstract

Recent works have successfully extended large-scale text-to-image models to the video domain, producing promising results but at a high computational cost and requiring a large amount of video data. In this work, we introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text, by leveraging the power of off-the-shelf text-to-image generation methods (e.g., Stable Diffusion). ConditionVideo generates realistic dynamic videos from random noise or given scene videos. Our method explicitly disentangles the motion representation into condition-guided and scenery motion components. To this end, the ConditionVideo model is designed with a UNet branch and a control branch. To improve temporal coherence, we introduce sparse bi-directional spatial-temporal attention (sBiST-Attn). The 3D control network extends the conventional 2D controlnet model, aiming to strengthen conditional generation accuracy by additionally leveraging the bi-directional frames in the temporal domain. Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods.

ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation

TL;DR

ConditionVideo presents a training-free approach to text-to-video generation that leverages off-the-shelf image diffusion models by disentangling motion into background and foreground components. It introduces a UNet branch for scenery motion and a 3D control branch for conditional guidance, augmented with sparse bi-directional spatial-temporal attention (sBiST-Attn) to improve temporal coherence. The method demonstrates state-of-the-art frame consistency, clip scores, and pose accuracy compared to Tune-A-Video, ControlNet, and Text2Video-Zero, with ablations validating the importance of pose conditioning, temporal modeling, and the 3D control branch. This work enables efficient, training-free generation of high-quality videos with controllable dynamics, offering practical benefits for AI-driven content creation and video synthesis pipelines.

Abstract

Recent works have successfully extended large-scale text-to-image models to the video domain, producing promising results but at a high computational cost and requiring a large amount of video data. In this work, we introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text, by leveraging the power of off-the-shelf text-to-image generation methods (e.g., Stable Diffusion). ConditionVideo generates realistic dynamic videos from random noise or given scene videos. Our method explicitly disentangles the motion representation into condition-guided and scenery motion components. To this end, the ConditionVideo model is designed with a UNet branch and a control branch. To improve temporal coherence, we introduce sparse bi-directional spatial-temporal attention (sBiST-Attn). The 3D control network extends the conventional 2D controlnet model, aiming to strengthen conditional generation accuracy by additionally leveraging the bi-directional frames in the temporal domain. Our method exhibits superior performance in terms of frame consistency, clip score, and conditional accuracy, outperforming other compared methods.
Paper Structure (26 sections, 2 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 26 sections, 2 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Our training-free method generates videos conditioned on different inputs. In (a), the illustration showcases the process of generation using provided scene videos and pose information, with the background wave exhibiting a convincingly lifelike motion. (b), (c), and (d) are generated based on condition only, which are pose, depth, and segmentation, respectively.
  • Figure 2: Illustration of our proposed training-free pipeline. (Left) Our framework consists of a UNet branch and a 3D control branch. The UNet branch receives either the inverted reference video $z_T^{INV}$ or image-level noise $\epsilon_b$ for background generation. The 3D control branch receives an encoded condition for foreground generation. Text description is fed into both branches. (Right) Illustration of our basic spatial-temporal block. We employ our proposed sBiST-Attn module into the basic block between the 3D convolution block and the cross-attention block. The detail of sBiST-Attn module is shown in Fig. \ref{['fig:attention']}
  • Figure 3: Illustration of ConditionVideo's sBiST-Attn. The purple blocks signify the frame we've selected for concatenation, which can be computed for key and value. The pink block represents the current block from which we'll calculate the query. The blue blocks correspond to the other frames within the video sequence. Latent features of frame $z_t^i$, bi-directional frames $z_t^{3j+1}, ~ j=0,...,\lfloor (F-1)/3 \rfloor$ are projected to query $Q$, key $K$ and value $V$. Then the attention-weighted sum is computed based on key, query, and value. The parameters are the same as the ones in the self-attention module of the pre-trained image model.
  • Figure 4: Qualitative comparison condition on the pose. "The Cowboy, on a rugged mountain range, Western painting style". Our result outperforms in both temporal consistency and pose accuracy, while others have difficulty in maintaining either one or both of the qualities.
  • Figure 5: Qualitative comparison condition on canny. "A man is runnin". Tune-A-Video experiences difficulties with canny-alignment, while ControlNet struggles to maintain temporal consistency. Though Text2Video surpasses these first two approaches, it inaccurately produces parts of the legs that don't align with the actual human body structure, and the colors of the shoes it generates are inconsistent.
  • ...and 2 more figures