Table of Contents
Fetching ...

How I Warped Your Noise: a Temporally-Correlated Noise Prior for Diffusion Models

Pascal Chang, Jingwei Tang, Markus Gross, Vinicius C. Azevedo

TL;DR

This work addresses temporal incoherence in diffusion-based video editing by introducing an $\int$-noise representation, which treats each pixel as the integral of an underlying high-resolution white noise field, and a distribution-preserving noise transport that warps noise across frames while maintaining its Gaussian properties. The authors derive a noise transport equation and provide a discrete, practical implementation that yields temporally correlated yet distribution-preserving noise samples, enabling more coherent video editing, restoration, and generation. Across multiple diffusion-based tasks, the $\int$-noise prior improves temporal coherence and reduces artifacts compared with fixed or random noise, while maintaining competitive image quality; however, it incurs higher computational cost and its benefits depend on the diffusion pipeline and data. The approach offers a principled framework for moving noise through time in diffusion models and suggests further exploration in latent-diffusion settings and video-input training.

Abstract

Video editing and generation methods often rely on pre-trained image-based diffusion models. During the diffusion process, however, the reliance on rudimentary noise sampling techniques that do not preserve correlations present in subsequent frames of a video is detrimental to the quality of the results. This either produces high-frequency flickering, or texture-sticking artifacts that are not amenable to post-processing. With this in mind, we propose a novel method for preserving temporal correlations in a sequence of noise samples. This approach is materialized by a novel noise representation, dubbed $\int$-noise (integral noise), that reinterprets individual noise samples as a continuously integrated noise field: pixel values do not represent discrete values, but are rather the integral of an underlying infinite-resolution noise over the pixel area. Additionally, we propose a carefully tailored transport method that uses $\int$-noise to accurately advect noise samples over a sequence of frames, maximizing the correlation between different frames while also preserving the noise properties. Our results demonstrate that the proposed $\int$-noise can be used for a variety of tasks, such as video restoration, surrogate rendering, and conditional video generation. See https://warpyournoise.github.io/ for video results.

How I Warped Your Noise: a Temporally-Correlated Noise Prior for Diffusion Models

TL;DR

This work addresses temporal incoherence in diffusion-based video editing by introducing an -noise representation, which treats each pixel as the integral of an underlying high-resolution white noise field, and a distribution-preserving noise transport that warps noise across frames while maintaining its Gaussian properties. The authors derive a noise transport equation and provide a discrete, practical implementation that yields temporally correlated yet distribution-preserving noise samples, enabling more coherent video editing, restoration, and generation. Across multiple diffusion-based tasks, the -noise prior improves temporal coherence and reduces artifacts compared with fixed or random noise, while maintaining competitive image quality; however, it incurs higher computational cost and its benefits depend on the diffusion pipeline and data. The approach offers a principled framework for moving noise through time in diffusion models and suggests further exploration in latent-diffusion settings and video-input training.

Abstract

Video editing and generation methods often rely on pre-trained image-based diffusion models. During the diffusion process, however, the reliance on rudimentary noise sampling techniques that do not preserve correlations present in subsequent frames of a video is detrimental to the quality of the results. This either produces high-frequency flickering, or texture-sticking artifacts that are not amenable to post-processing. With this in mind, we propose a novel method for preserving temporal correlations in a sequence of noise samples. This approach is materialized by a novel noise representation, dubbed -noise (integral noise), that reinterprets individual noise samples as a continuously integrated noise field: pixel values do not represent discrete values, but are rather the integral of an underlying infinite-resolution noise over the pixel area. Additionally, we propose a carefully tailored transport method that uses -noise to accurately advect noise samples over a sequence of frames, maximizing the correlation between different frames while also preserving the noise properties. Our results demonstrate that the proposed -noise can be used for a variety of tasks, such as video restoration, surrogate rendering, and conditional video generation. See https://warpyournoise.github.io/ for video results.

Paper Structure

This paper contains 37 sections, 39 equations, 19 figures, 9 tables, 2 algorithms.

Figures (19)

  • Figure 1: Our noise warping method lifts diffusion-based image editing methods like SDEdit Meng2022 and Person Image Diffusion Model (PIDM) Bhunia2022 to the temporal domain. It avoids unnatural flickering and texture sticking artifacts (see colored squares) that commonly appears with standard noise priors.
  • Figure 2: (a) The discrete noise transport equation pipeline. A subdivided pixel contour (top right) is triangulated and traced backwards from frame $T$ to frame $0$ (top left). Then the warped triangulated shape is rasterized into a higher resolution approximation of the white noise (bottom). The sub-pixel values are added together, and properly scaled by Equation (\ref{['eq:discrete_white_noise_transport']}). (b) 1-D toy example. A pixel slides between two existing pixels whose values $x_0$, $x_1$ are sampled from a Gaussian distribution. Bilinear interpolation creates a sample of lower variance (straight line), whereas $\int$-noise would create samples that follow a Brownian bridge between $x_0$ and $x_1$, maintaining a unit variance.
  • Figure 3: We visualize the correlation between two $4\times 4$ noise samples with the warping being a horizontal shift by $\Delta x = 3.6$ pixels. Our $\int$-noise prior preserves the correlation between the two noise samples as well as bilinear interpolation (left), while avoiding self-correlation between pixels in the warped noise (right).
  • Figure 4: Qualitative comparison of our noise warping method with baselines for video restoration tasks using image models by I$^2$SB: consecutive frames comparison (top), $x$-$t$ slice (bottom).
  • Figure 5: Fluid $4\times$ super-resolution. As shown in $x$-$t$ slices (bottom row), Random Noise creates incoherent details (noise in the slice) while Fixed Noise suffers from sticking artifacts (vertical lines in the slice). Our $\int$-noise moves the fluid in a smoother way.
  • ...and 14 more figures