Table of Contents
Fetching ...

FreeInit: Bridging Initialization Gap in Video Diffusion Models

Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu

TL;DR

The paper identifies a training–inference initialization gap in video diffusion models caused by divergent frequency characteristics and low-frequency information leakage in the initial noise. It introduces FreeInit, an inference-time procedure that iteratively refines the low-frequency content of the initial noise by diffusing to $z_T$, mixing in low-frequency content with high-frequency randomness via a 3D FFT-based filter, and repeating this process to align with the training distribution. Across multiple T2V models and prompts, FreeInit yields consistent gains in temporal consistency and motion realism without additional training. The approach is lightweight, model-agnostic, and broadly applicable to diffusion-based video (and potentially image) generation, with practical considerations for inference time and parameter tuning.

Abstract

Though diffusion-based video generation has witnessed rapid progress, the inference results of existing models still exhibit unsatisfactory temporal consistency and unnatural dynamics. In this paper, we delve deep into the noise initialization of video diffusion models, and discover an implicit training-inference gap that attributes to the unsatisfactory inference quality.Our key findings are: 1) the spatial-temporal frequency distribution of the initial noise at inference is intrinsically different from that for training, and 2) the denoising process is significantly influenced by the low-frequency components of the initial noise. Motivated by these observations, we propose a concise yet effective inference sampling strategy, FreeInit, which significantly improves temporal consistency of videos generated by diffusion models. Through iteratively refining the spatial-temporal low-frequency components of the initial latent during inference, FreeInit is able to compensate the initialization gap between training and inference, thus effectively improving the subject appearance and temporal consistency of generation results. Extensive experiments demonstrate that FreeInit consistently enhances the generation quality of various text-to-video diffusion models without additional training or fine-tuning.

FreeInit: Bridging Initialization Gap in Video Diffusion Models

TL;DR

The paper identifies a training–inference initialization gap in video diffusion models caused by divergent frequency characteristics and low-frequency information leakage in the initial noise. It introduces FreeInit, an inference-time procedure that iteratively refines the low-frequency content of the initial noise by diffusing to , mixing in low-frequency content with high-frequency randomness via a 3D FFT-based filter, and repeating this process to align with the training distribution. Across multiple T2V models and prompts, FreeInit yields consistent gains in temporal consistency and motion realism without additional training. The approach is lightweight, model-agnostic, and broadly applicable to diffusion-based video (and potentially image) generation, with practical considerations for inference time and parameter tuning.

Abstract

Though diffusion-based video generation has witnessed rapid progress, the inference results of existing models still exhibit unsatisfactory temporal consistency and unnatural dynamics. In this paper, we delve deep into the noise initialization of video diffusion models, and discover an implicit training-inference gap that attributes to the unsatisfactory inference quality.Our key findings are: 1) the spatial-temporal frequency distribution of the initial noise at inference is intrinsically different from that for training, and 2) the denoising process is significantly influenced by the low-frequency components of the initial noise. Motivated by these observations, we propose a concise yet effective inference sampling strategy, FreeInit, which significantly improves temporal consistency of videos generated by diffusion models. Through iteratively refining the spatial-temporal low-frequency components of the initial latent during inference, FreeInit is able to compensate the initialization gap between training and inference, thus effectively improving the subject appearance and temporal consistency of generation results. Extensive experiments demonstrate that FreeInit consistently enhances the generation quality of various text-to-video diffusion models without additional training or fine-tuning.
Paper Structure (22 sections, 13 equations, 24 figures, 5 tables)

This paper contains 22 sections, 13 equations, 24 figures, 5 tables.

Figures (24)

  • Figure 1: FreeInit for Video Generation. We propose FreeInit, a concise yet effective method to significantly improve temporal consistency of videos generated by diffusion models. FreeInit requires no additional training, introduces no learnable parameters, and can be easily incorporated into arbitrary video diffusion models at inference time.
  • Figure 2: Visualization of Decoded Noisy Latent from Different Spatio-Temporal Frequency Bands at Training. (a) Video frames decoded from the entire frequency band of the noisy latent $z_t$ in DDPM Forward Process. (b) Frames decoded from the low-frequency components of $z_t$. It is evident that the diffusion process has difficulty in fully corrupting the semantics, leaving substantial spatio-temporal correlations in the low-frequency components. (c) Frames decoded from the high-frequency components of $z_t$. Each frame degenerates rapidly with the diffusion process.
  • Figure 3: Signal-to-Noise Ratio (SNR) of different frequency bands at the forward diffusion process. Each curve corresponds to a spatio-temporal frequency band of the latent code $z_t$ when adding noise at training. The pattern indicates a much slower corruption on low-frequency components.
  • Figure 4: Frequency Distribution of the SNR in the initial noise. When training with the typical Stable Diffusion Noise Schedule, the SNR of the initial noise is extremely high in low-frequency components, even larger than 0 dB (red circle). This indicates a severe information leak at the low-frequency band.
  • Figure 5: Role of Initial Low-Frequency Components. Each column shows three frames generated from the mixed initial noise. We observe that even if the majority (e.g., 80%) of high frequencies are replaced, the generated results still remain largely similar to the original "Full $z_T$" frames, indicating that the overall distribution of the generated results is determined by the low-frequency components of the initial noise.
  • ...and 19 more figures