Table of Contents
Fetching ...

UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control

Tian Xia, Xuweiyi Chen, Sihan Xu

TL;DR

UniCtrl is introduced, a novel, plug-and-play method that is universally applicable to improve the spatiotemporal consistency and motion diversity of videos generated by text-to-video models without additional training.

Abstract

Video Diffusion Models have been developed for video generation, usually integrating text and image conditioning to enhance control over the generated content. Despite the progress, ensuring consistency across frames remains a challenge, particularly when using text prompts as control conditions. To address this problem, we introduce UniCtrl, a novel, plug-and-play method that is universally applicable to improve the spatiotemporal consistency and motion diversity of videos generated by text-to-video models without additional training. UniCtrl ensures semantic consistency across different frames through cross-frame self-attention control, and meanwhile, enhances the motion quality and spatiotemporal consistency through motion injection and spatiotemporal synchronization. Our experimental results demonstrate UniCtrl's efficacy in enhancing various text-to-video models, confirming its effectiveness and universality.

UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control

TL;DR

UniCtrl is introduced, a novel, plug-and-play method that is universally applicable to improve the spatiotemporal consistency and motion diversity of videos generated by text-to-video models without additional training.

Abstract

Video Diffusion Models have been developed for video generation, usually integrating text and image conditioning to enhance control over the generated content. Despite the progress, ensuring consistency across frames remains a challenge, particularly when using text prompts as control conditions. To address this problem, we introduce UniCtrl, a novel, plug-and-play method that is universally applicable to improve the spatiotemporal consistency and motion diversity of videos generated by text-to-video models without additional training. UniCtrl ensures semantic consistency across different frames through cross-frame self-attention control, and meanwhile, enhances the motion quality and spatiotemporal consistency through motion injection and spatiotemporal synchronization. Our experimental results demonstrate UniCtrl's efficacy in enhancing various text-to-video models, confirming its effectiveness and universality.
Paper Structure (38 sections, 7 equations, 12 figures, 6 tables, 4 algorithms)

This paper contains 38 sections, 7 equations, 12 figures, 6 tables, 4 algorithms.

Figures (12)

  • Figure 1: UniCtrl for Video Generation. we propose UniCtrl, a concise yet effective method to significantly improve the temporal consistency of videos generated by diffusion models yet also preserve the motion. UniCtrl requires no additional training and introduces no learnable parameters, and can be treated as a plug-and-play module at inference time.
  • Figure 2: The first row demonstrates the original frames generated with the baseline model, in which the vehicle and the road are inconsistent across the frames. The second row shows frames generated with baseline model augmented with cross-frame Self-Attention Control (SAC). While it maintains incredible spatiotemporal consistency, it exhibits little motion. The third row explains frames augmented with SAC and Motion Injection (MI). Although MI injects more motion in addition to SAC, the results demonstrate that it falls short on spatiotemporal consistency again. The fourth row contains frames further augmented with Spatiotemporal Synchronization (SS) in addition to SAC and MI, which improves spatiotemporal consistency over the results from the third row and achieves a balance between motion and spatiotemporal consistency, both in-frame and cross-frame.
  • Figure 3: In our framework, we use key and value from the first frame as represented by $K^0$ and $V^0$ in the self-attention block. We also use another branch to keep the motion query $Q_\textrm{m}$ for motion control. At the beginning of every sampling step, we let the motion latent equal to the sampling latent, to avoid spatial-temporal inconsistency. Note that in the actual workflow, Q replacement occurs only in cross-attention, as the Q in the self-attention blocks of both branches are always the same. We explain details of our framework in Algorithm \ref{['alg:selfattn']} and Algorithm \ref{['alg:mi']}.
  • Figure 4: Qualitative Comparisons. We demonstrate UniCtrl's adaptability to diverse prompts, enhancing temporal consistency and preserving motion diversity. Comparative inference results with FreeInit are presented for context. Additionally, we demonstrate UniCtrl's seamless integration with FreeInit.
  • Figure 5: We provide additional qualitative examples across various backbones to demonstrate UniCtrl's capability in enhancing spatiotemporal consistency while effectively preserving motion dynamics.
  • ...and 7 more figures