Table of Contents
Fetching ...

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Yabo Zhang, Yuxiang Wei, Xianhui Lin, Zheng Hui, Peiran Ren, Xuansong Xie, Xiangyang Ji, Wangmeng Zuo

TL;DR

VideoElevator addresses the gap in video generation quality by decoupling each diffusion sampling step into temporal refinement and spatial enhancement, enabling training-free interaction between text-to-video and text-to-image diffusion models. By applying a temporal LPFF and T2V-based motion editing to obtain a motion-consistent latent, then invert to a T2I-compatible noise latent, and finally applying an inflated T2I with cross-frame attention to elevate details, it yields higher frame quality and better alignment with prompts. The approach works with both foundational and personalized T2I, improving baselines and enabling stylistically faithful, high-quality video synthesis without additional training. The results are validated through quantitative metrics, human studies, and ablations, highlighting the importance of LPFF, DDIM inversion, and cross-frame attention in achieving temporal coherence and visual fidelity.

Abstract

Text-to-image diffusion models (T2I) have demonstrated unprecedented capabilities in creating realistic and aesthetic images. On the contrary, text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment, owing to insufficient quality and quantity of training videos. In this paper, we introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I. Different from conventional T2V sampling (i.e., temporal and spatial modeling), VideoElevator explicitly decomposes each sampling step into temporal motion refining and spatial quality elevating. Specifically, temporal motion refining uses encapsulated T2V to enhance temporal consistency, followed by inverting to the noise distribution required by T2I. Then, spatial quality elevating harnesses inflated T2I to directly predict less noisy latent, adding more photo-realistic details. We have conducted experiments in extensive prompts under the combination of various T2V and T2I. The results show that VideoElevator not only improves the performance of T2V baselines with foundational T2I, but also facilitates stylistic video synthesis with personalized T2I. Our code is available at https://github.com/YBYBZhang/VideoElevator.

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

TL;DR

VideoElevator addresses the gap in video generation quality by decoupling each diffusion sampling step into temporal refinement and spatial enhancement, enabling training-free interaction between text-to-video and text-to-image diffusion models. By applying a temporal LPFF and T2V-based motion editing to obtain a motion-consistent latent, then invert to a T2I-compatible noise latent, and finally applying an inflated T2I with cross-frame attention to elevate details, it yields higher frame quality and better alignment with prompts. The approach works with both foundational and personalized T2I, improving baselines and enabling stylistically faithful, high-quality video synthesis without additional training. The results are validated through quantitative metrics, human studies, and ablations, highlighting the importance of LPFF, DDIM inversion, and cross-frame attention in achieving temporal coherence and visual fidelity.

Abstract

Text-to-image diffusion models (T2I) have demonstrated unprecedented capabilities in creating realistic and aesthetic images. On the contrary, text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment, owing to insufficient quality and quantity of training videos. In this paper, we introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I. Different from conventional T2V sampling (i.e., temporal and spatial modeling), VideoElevator explicitly decomposes each sampling step into temporal motion refining and spatial quality elevating. Specifically, temporal motion refining uses encapsulated T2V to enhance temporal consistency, followed by inverting to the noise distribution required by T2I. Then, spatial quality elevating harnesses inflated T2I to directly predict less noisy latent, adding more photo-realistic details. We have conducted experiments in extensive prompts under the combination of various T2V and T2I. The results show that VideoElevator not only improves the performance of T2V baselines with foundational T2I, but also facilitates stylistic video synthesis with personalized T2I. Our code is available at https://github.com/YBYBZhang/VideoElevator.
Paper Structure (14 sections, 9 equations, 8 figures, 6 tables)

This paper contains 14 sections, 9 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Videos enhanced by VideoElevator. VideoElevator aims at elevating the quality of videos generated by existing text-to-video models (e.g., ZeroScope) with text-to-image diffusion models (e.g., RealisticVision). It is training-free and plug-and-play to support cooperation of various text-to-video and text-to-image diffusion models. Best viewed with Acrobat Reader. Click images to play the videos.
  • Figure 2: VideoElevator for improved text-to-video generation.Top: Taking text $\tau$ as input, conventional T2V performs both temporal and spatial modeling and accumulates low-quality contents throughout sampling chain. Bottom: VideoElevator explicitly decompose each step into temporal motion refining and spatial quality elevating, where the former encapsulates T2V to enhance temporal consistency and the latter harnesses T2I to provide more faithful details, e.g., dressed in suit. Empirically, applying T2V in several timesteps is sufficient to ensure temporal consistency.
  • Figure 3: Overview of VideoElevator. VideoElevator explicitly decomposes each sampling step into temporal motion refining and spatial quality elevating. Temporal motion refining uses a Low-Pass Frequency Filter (LPFF) to reduce flickers and T2V-based SDEdit meng2021sdedit to add fine-grained motion, and then inverts the latent to $\tilde{{\bm{z}}}_{t}$ with DDIM inversion song2021denoising. Spatial quality elevating harnesses inflated T2I to directly transition $\tilde{{\bm{z}}}_{t}$ to ${\bm{z}}_{t-1}$, where self-attention of T2I is inflated into cross-frame attention. To ensure interaction between T2V and T2I, noise latents are equally projected to clean latents with Eqn. \ref{['eq:project']}, e.g., ${\bm{z}}_{t}$ to ${\bm{z}}_{t\rightarrow 0}$.
  • Figure 4: Qualitative results enhanced with foundational T2I. As one can see, VideoElevator manages to enhance the performance of T2V baselines with StableDiffusion V1.5 or V2.1-base, in terms of frame quality and text alignment. For frame quality, the videos enhanced by VideoElevator contain more details than original videos. For text alignment, VideoElevator also produces videos that adhere better to prompts, where inconsistent parts of baselines are colored in orange. Please watch videos in website for better view.
  • Figure 5: Qualitative results enhanced with personalized T2I. With the power of personalized T2I, VideoElevator enables ZeroScope and LaVie to produce various styles of high-quality videos. Compared to personalized AnimateDiff, VideoElevator captures more faithful styles and photo-realistic details from personalized T2I, e.g., sunset time lapse at the beach. Please watch videos in website for better view.
  • ...and 3 more figures