Table of Contents
Fetching ...

DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis

Zhongjie Duan, Lizhou You, Chengyu Wang, Cen Chen, Ziheng Wu, Weining Qian, Jun Huang

TL;DR

DiffSynth tackles flicker in diffusion-based video synthesis by applying deflickering in the latent space during iteration and by a patch-based remapping/blending technique to ensure frame-to-frame consistency. The latent deflickering prevents flicker accumulation across denoising steps, while the patch blending aligns object appearances across frames with an efficient $O(n\log n)$ strategy. The approach adapts image synthesis pipelines to video across tasks like text-guided stylization, fashion video synthesis, image-guided stylization, restoration, and 3D rendering, outperforming prior zero-shot baselines in quantitative metrics and user studies. It leverages compatibility with Stable Diffusion variants and includes multiple practical modifications for real-world use, with acknowledged runtime considerations and avenues for future improvements.

Abstract

In recent years, diffusion models have emerged as the most powerful approach in image synthesis. However, applying these models directly to video synthesis presents challenges, as it often leads to noticeable flickering contents. Although recently proposed zero-shot methods can alleviate flicker to some extent, we still struggle to generate coherent videos. In this paper, we propose DiffSynth, a novel approach that aims to convert image synthesis pipelines to video synthesis pipelines. DiffSynth consists of two key components: a latent in-iteration deflickering framework and a video deflickering algorithm. The latent in-iteration deflickering framework applies video deflickering to the latent space of diffusion models, effectively preventing flicker accumulation in intermediate steps. Additionally, we propose a video deflickering algorithm, named patch blending algorithm, that remaps objects in different frames and blends them together to enhance video consistency. One of the notable advantages of DiffSynth is its general applicability to various video synthesis tasks, including text-guided video stylization, fashion video synthesis, image-guided video stylization, video restoring, and 3D rendering. In the task of text-guided video stylization, we make it possible to synthesize high-quality videos without cherry-picking. The experimental results demonstrate the effectiveness of DiffSynth. All videos can be viewed on our project page. Source codes will also be released.

DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis

TL;DR

DiffSynth tackles flicker in diffusion-based video synthesis by applying deflickering in the latent space during iteration and by a patch-based remapping/blending technique to ensure frame-to-frame consistency. The latent deflickering prevents flicker accumulation across denoising steps, while the patch blending aligns object appearances across frames with an efficient strategy. The approach adapts image synthesis pipelines to video across tasks like text-guided stylization, fashion video synthesis, image-guided stylization, restoration, and 3D rendering, outperforming prior zero-shot baselines in quantitative metrics and user studies. It leverages compatibility with Stable Diffusion variants and includes multiple practical modifications for real-world use, with acknowledged runtime considerations and avenues for future improvements.

Abstract

In recent years, diffusion models have emerged as the most powerful approach in image synthesis. However, applying these models directly to video synthesis presents challenges, as it often leads to noticeable flickering contents. Although recently proposed zero-shot methods can alleviate flicker to some extent, we still struggle to generate coherent videos. In this paper, we propose DiffSynth, a novel approach that aims to convert image synthesis pipelines to video synthesis pipelines. DiffSynth consists of two key components: a latent in-iteration deflickering framework and a video deflickering algorithm. The latent in-iteration deflickering framework applies video deflickering to the latent space of diffusion models, effectively preventing flicker accumulation in intermediate steps. Additionally, we propose a video deflickering algorithm, named patch blending algorithm, that remaps objects in different frames and blends them together to enhance video consistency. One of the notable advantages of DiffSynth is its general applicability to various video synthesis tasks, including text-guided video stylization, fashion video synthesis, image-guided video stylization, video restoring, and 3D rendering. In the task of text-guided video stylization, we make it possible to synthesize high-quality videos without cherry-picking. The experimental results demonstrate the effectiveness of DiffSynth. All videos can be viewed on our project page. Source codes will also be released.
Paper Structure (19 sections, 12 equations, 6 figures, 5 tables)

This paper contains 19 sections, 12 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: An example of text-guided video stylization. The prompt in this example is "an orange and white cat".
  • Figure 2: An example of fashion video synthesis.
  • Figure 3: An example of ablation study. The prompt of this example is "cyberpunk, city, red neon light".
  • Figure 4: Examples of image-guided video stylization.
  • Figure 5: An example of video restoring.
  • ...and 1 more figures