IV-Mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis

Shitong Shao; Zikai Zhou; Lichen Bai; Haoyi Xiong; Zeke Xie

IV-Mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis

Shitong Shao, Zikai Zhou, Lichen Bai, Haoyi Xiong, Zeke Xie

TL;DR

A novel training-free algorithm IV-Mixed Sampler that leverages the strengths of image diffusion models (IDMs) to assist VDMs surpass their current capabilities and achieves state-of-the-art performance on 4 benchmarks.

Abstract

The multi-step sampling mechanism, a key feature of visual diffusion models, has significant potential to replicate the success of OpenAI's Strawberry in enhancing performance by increasing the inference computational cost. Sufficient prior studies have demonstrated that correctly scaling up computation in the sampling process can successfully lead to improved generation quality, enhanced image editing, and compositional generalization. While there have been rapid advancements in developing inference-heavy algorithms for improved image generation, relatively little work has explored inference scaling laws in video diffusion models (VDMs). Furthermore, existing research shows only minimal performance gains that are perceptible to the naked eye. To address this, we design a novel training-free algorithm IV-Mixed Sampler that leverages the strengths of image diffusion models (IDMs) to assist VDMs surpass their current capabilities. The core of IV-Mixed Sampler is to use IDMs to significantly enhance the quality of each video frame and VDMs ensure the temporal coherence of the video during the sampling process. Our experiments have demonstrated that IV-Mixed Sampler achieves state-of-the-art performance on 4 benchmarks including UCF-101-FVD, MSR-VTT-FVD, Chronomagic-Bench-150, and Chronomagic-Bench-1649. For example, the open-source Animatediff with IV-Mixed Sampler reduces the UMT-FVD score from 275.2 to 228.6, closing to 223.1 from the closed-source Pika-2.0.

IV-Mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis

TL;DR

Abstract

Paper Structure (42 sections, 1 theorem, 13 equations, 12 figures, 6 tables, 1 algorithm)

This paper contains 42 sections, 1 theorem, 13 equations, 12 figures, 6 tables, 1 algorithm.

Introduction
Motivation.
Contribution.
Preliminary
Diffusion Models.
Video Diffusion Model vs. Image Diffusion Model.
SDEdit.
DDIM & DDIM-Inversion.
Approach
Forward (Go!!) and Reverse (Back!!)
IV-mixed Sampler
Discussion
Hyperparameter Design Space.
Theoretical Analysis.
The Influence of Latent Space.
...and 27 more sections

Key Result

Theorem 3.1

(the proof in Appendix apd:theoretical) IV-mixed Sampler can be transferred to an ODE. For example, the ODE corresponding to "IV-IV" is Here, $\omega$ refers to the vanilla CFG scale, while both $\omega^\textrm{IDM}_\textrm{go-back}$ and $\omega^\textrm{IDM}_\textrm{go-back}$ are CFG scales that are greater than 0. Let $\nabla_\mathbf{x}\log q^\textrm{IDM}_{t}(\mathbf{c}|\mathbf{x})$ and $\nabla_

Figures (12)

Figure 1: Visualization of IV-mixed Sampler and the standard DDIM sampling on Animatediff and VideoCrafterV2. Unlike prior heavy-inference approaches guo2024i4vgenfreeinit, IV-mixed Sampler is able to significantly improve the fidelity of the video while guaranteeing semantic faithfulness.
Figure 2: UMTScore (↑) vs. UMT-FVD (↓) with Animatediff animatediff on Chronomagic-Bench-150 yuan2024chronomagic. In the legend, "R", "I", and "V" represent the score function estimation using random Gaussian noise, IDM, and VDM, respectively. Moreover, the front of the horizontal line "-" refers to the additive noise form, while the back of "-" represents the denoising paradigm. For instance, "RR-II" stands for a two-step of adding noise with Guassian noise followed by two-step of denoising performed using IDM.
Figure 3: Overview of our IV-mixed Sampler, Video-based Sampler and Image-based Sampler.IV-mixed Sampler utilizes IDM and VDM to ensure synthesized video quality and temporal coherence, respectively.
Figure 4: Ablation studies on sampling intervals of IV-mixed Sampler ("IV-IV") with Animatediff (SD V1.5, Motion Adapter V3). The Begin%-End% in the legend indicates the portion of the entire sampling process performed by IV-mixed Sampler. For example, in a 50-step sampling scenario, 0%-50% corresponds to IV-mixed Sampler being applied during steps 1-25. More details of "IV-VI" can be found in Appendix \ref{['apd:additional_ablation_study']}.
Figure 5: Ablation studies on different $C_4^2$ species combinations with Animatediff (SD V1.5, Motion Adapter V3). We can clearly observe that IV-mixed Sampler ("IV-IV") is the winner across all metrics.
...and 7 more figures

Theorems & Definitions (1)

Theorem 3.1

IV-Mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis

TL;DR

Abstract

IV-Mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (1)