Table of Contents
Fetching ...

On Equivariance and Fast Sampling in Video Diffusion Models Trained with Warped Noise

Chao Liu, Arash Vahdat

TL;DR

The paper analyzes warped noise for video diffusion models and proves that training with warped noise under the standard denoising objective induces equivariance to spatial warps, enabling motion-consistent video generation with fewer sampling steps. The proposed EquiVDM achieves superior motion alignment and temporal coherence without architectural changes and can be distilled into a one-step model via distribution-matching distillation. To address latent-space inconsistencies, a small independent-noise component is added, improving robustness. Across benchmarks, EquiVDM demonstrates stronger quality and motion controllability with reduced sampling costs, highlighting practical benefits for real-time video generation and video-to-video tasks.

Abstract

Temporally consistent video-to-video generation is critical for applications such as style transfer and upsampling. In this paper, we provide a theoretical analysis of warped noise - a recently proposed technique for training video diffusion models - and show that pairing it with the standard denoising objective implicitly trains models to be equivariant to spatial transformations of the input noise, which we term EquiVDM. This equivariance enables motion in the input noise to align naturally with motion in the generated video, yielding coherent, high-fidelity outputs without the need for specialized modules or auxiliary losses. A further advantage is sampling efficiency: EquiVDM achieves comparable or superior quality in far fewer sampling steps. When distilled into one-step student models, EquiVDM preserves equivariance and delivers stronger motion controllability and fidelity than distilled nonequivariant baselines. Across benchmarks, EquiVDM consistently outperforms prior methods in motion alignment, temporal consistency, and perceptual quality, while substantially lowering sampling cost.

On Equivariance and Fast Sampling in Video Diffusion Models Trained with Warped Noise

TL;DR

The paper analyzes warped noise for video diffusion models and proves that training with warped noise under the standard denoising objective induces equivariance to spatial warps, enabling motion-consistent video generation with fewer sampling steps. The proposed EquiVDM achieves superior motion alignment and temporal coherence without architectural changes and can be distilled into a one-step model via distribution-matching distillation. To address latent-space inconsistencies, a small independent-noise component is added, improving robustness. Across benchmarks, EquiVDM demonstrates stronger quality and motion controllability with reduced sampling costs, highlighting practical benefits for real-time video generation and video-to-video tasks.

Abstract

Temporally consistent video-to-video generation is critical for applications such as style transfer and upsampling. In this paper, we provide a theoretical analysis of warped noise - a recently proposed technique for training video diffusion models - and show that pairing it with the standard denoising objective implicitly trains models to be equivariant to spatial transformations of the input noise, which we term EquiVDM. This equivariance enables motion in the input noise to align naturally with motion in the generated video, yielding coherent, high-fidelity outputs without the need for specialized modules or auxiliary losses. A further advantage is sampling efficiency: EquiVDM achieves comparable or superior quality in far fewer sampling steps. When distilled into one-step student models, EquiVDM preserves equivariance and delivers stronger motion controllability and fidelity than distilled nonequivariant baselines. Across benchmarks, EquiVDM consistently outperforms prior methods in motion alignment, temporal consistency, and perceptual quality, while substantially lowering sampling cost.

Paper Structure

This paper contains 19 sections, 1 theorem, 6 equations, 19 figures, 4 tables.

Key Result

Theorem 4.1

Consider a temporally consistent video with $K$ frames $\mathbf{V}=(V^{(0)}, V^{(1)}, \dots, V^{(K)})$, where each frame is obtained by a warping transformation of the first frame $V^{(0)}$, i.e., $V^{(k)} = \mathcal{T}_k \circ V^{(0)}$. Let the noisy video $\mathbf{V}_t$ be generated with consisten

Figures (19)

  • Figure 1: EquiVDM: A video diffusion model that is equivariant to input spatial transformations generates videos with the same spatial transformation when provided with warped noise.
  • Figure 2: The values of three tracked points in the video frames in the pixel, latent and noise videos. The variantion in the latent video is much larger than the one in the pixel and noise videos due to the compression in the latent space.
  • Figure 3: Frames from the generated videos with different video-to-video generation models. VC2-EquiVDM uses warped noise without dense video conditioning; CtrlVid chen2023control, T2V-Zero khachatryan2023text2video, CtrlAdapter lin2024ctrl and VC2-EquiVDM-softedge use either canny edge or softedge.
  • Figure 4: Straightness of generation trajectories for VACE vace2025 with independent noise and VACE-EquiVDM with warped noise.
  • Figure 5: (a) The noise-to-video distance reduces with the warped noise. (b) Less sampling steps are needed for EquiVDM to achieve similar or better quality compared to independent noise.
  • ...and 14 more figures

Theorems & Definitions (2)

  • Theorem 4.1
  • proof