Table of Contents
Fetching ...

Noise Crystallization and Liquid Noise: Zero-shot Video Generation using Image Diffusion Models

Muhammad Haaris Khan, Hadrien Reynaud, Bernhard Kainz

TL;DR

This paper tackles the challenge of producing temporally consistent video from image diffusion models without training video-specific parameters. It introduces two zero-shot techniques, Noise Crystallization and Liquid Noise, which alter the input noise (and, for Liquid Noise, decoded latents via flow maps) to generate sequential frames while preserving detail. The work also probes the VAE embedding in latent diffusion models, revealing a disentangled latent space and VAE-driven upscaling characteristics that aid motion control and robustness. Across diverse applications—image-to-video, relighting, video-to-video with noise tracking, and seamless upscaling—the methods achieve competitive temporal coherence with substantially lower compute than full video models, suggesting practical pathways for resource-efficient video synthesis.

Abstract

Although powerful for image generation, consistent and controllable video is a longstanding problem for diffusion models. Video models require extensive training and computational resources, leading to high costs and large environmental impacts. Moreover, video models currently offer limited control of the output motion. This paper introduces a novel approach to video generation by augmenting image diffusion models to create sequential animation frames while maintaining fine detail. These techniques can be applied to existing image models without training any video parameters (zero-shot) by altering the input noise in a latent diffusion model. Two complementary methods are presented. Noise crystallization ensures consistency but is limited to large movements due to reduced latent embedding sizes. Liquid noise trades consistency for greater flexibility without resolution limitations. The core concepts also allow other applications such as relighting, seamless upscaling, and improved video style transfer. Furthermore, an exploration of the VAE embedding used for latent diffusion models is performed, resulting in interesting theoretical insights such as a method for human-interpretable latent spaces.

Noise Crystallization and Liquid Noise: Zero-shot Video Generation using Image Diffusion Models

TL;DR

This paper tackles the challenge of producing temporally consistent video from image diffusion models without training video-specific parameters. It introduces two zero-shot techniques, Noise Crystallization and Liquid Noise, which alter the input noise (and, for Liquid Noise, decoded latents via flow maps) to generate sequential frames while preserving detail. The work also probes the VAE embedding in latent diffusion models, revealing a disentangled latent space and VAE-driven upscaling characteristics that aid motion control and robustness. Across diverse applications—image-to-video, relighting, video-to-video with noise tracking, and seamless upscaling—the methods achieve competitive temporal coherence with substantially lower compute than full video models, suggesting practical pathways for resource-efficient video synthesis.

Abstract

Although powerful for image generation, consistent and controllable video is a longstanding problem for diffusion models. Video models require extensive training and computational resources, leading to high costs and large environmental impacts. Moreover, video models currently offer limited control of the output motion. This paper introduces a novel approach to video generation by augmenting image diffusion models to create sequential animation frames while maintaining fine detail. These techniques can be applied to existing image models without training any video parameters (zero-shot) by altering the input noise in a latent diffusion model. Two complementary methods are presented. Noise crystallization ensures consistency but is limited to large movements due to reduced latent embedding sizes. Liquid noise trades consistency for greater flexibility without resolution limitations. The core concepts also allow other applications such as relighting, seamless upscaling, and improved video style transfer. Furthermore, an exploration of the VAE embedding used for latent diffusion models is performed, resulting in interesting theoretical insights such as a method for human-interpretable latent spaces.
Paper Structure (21 sections, 2 equations, 22 figures, 2 tables)

This paper contains 21 sections, 2 equations, 22 figures, 2 tables.

Figures (22)

  • Figure 1: Prompt-to-Video using noise crystallization (Segmentation map on left). Full animations viewable in supplementary materials supplementary.
  • Figure 2: Prompt-to-Video using liquid noise. Difference image (right) shows cloud motion and subtle swaying of grass and trees.
  • Figure 3: Image-to-Video. A flow map used to move the arms and mouth of characters from Adventure Timeadventure_time. Use permitted by Exceptions to copyright: Non-commercial researchgov_copyright.
  • Figure 4: Improved Video-to-Video style transfer. Notice the facial distortion in the examples without noise tracking (second column). Original video taken from videoworldsimulators2024.
  • Figure 5: Result of translating input noise to the left for each sample (no segmentation map usage).
  • ...and 17 more figures