Noise Crystallization and Liquid Noise: Zero-shot Video Generation using Image Diffusion Models
Muhammad Haaris Khan, Hadrien Reynaud, Bernhard Kainz
TL;DR
This paper tackles the challenge of producing temporally consistent video from image diffusion models without training video-specific parameters. It introduces two zero-shot techniques, Noise Crystallization and Liquid Noise, which alter the input noise (and, for Liquid Noise, decoded latents via flow maps) to generate sequential frames while preserving detail. The work also probes the VAE embedding in latent diffusion models, revealing a disentangled latent space and VAE-driven upscaling characteristics that aid motion control and robustness. Across diverse applications—image-to-video, relighting, video-to-video with noise tracking, and seamless upscaling—the methods achieve competitive temporal coherence with substantially lower compute than full video models, suggesting practical pathways for resource-efficient video synthesis.
Abstract
Although powerful for image generation, consistent and controllable video is a longstanding problem for diffusion models. Video models require extensive training and computational resources, leading to high costs and large environmental impacts. Moreover, video models currently offer limited control of the output motion. This paper introduces a novel approach to video generation by augmenting image diffusion models to create sequential animation frames while maintaining fine detail. These techniques can be applied to existing image models without training any video parameters (zero-shot) by altering the input noise in a latent diffusion model. Two complementary methods are presented. Noise crystallization ensures consistency but is limited to large movements due to reduced latent embedding sizes. Liquid noise trades consistency for greater flexibility without resolution limitations. The core concepts also allow other applications such as relighting, seamless upscaling, and improved video style transfer. Furthermore, an exploration of the VAE embedding used for latent diffusion models is performed, resulting in interesting theoretical insights such as a method for human-interpretable latent spaces.
