Table of Contents
Fetching ...

DNI: Dilutional Noise Initialization for Diffusion Video Editing

Sunjae Yoon, Gwanhyeong Koo, Ji Woo Hong, Chang D. Yoo

TL;DR

DNI addresses the challenge of non-rigid editing in diffusion video editing, where the initial latent noise often preserves input structural cues that hinder dynamic motion changes. The authors propose a Dilutional Noise Initialization framework that disentangles the initial noise into a visual component and Gaussian noise via an adaptive spectral filter, then dilutes the visual noise in targeted editing regions guided by cross-attention-derived masks, producing a dilutional latent noise $z^{\star}$. The method is model-agnostic and plug-and-play, improving textual alignment, fidelity, and frame consistency across both tuning-based and tuning-free editors on DAVIS and TGVE benchmarks, with notable gains in non-rigid editing. These contributions enable more flexible, precise video editing with diffusion models, offering practical impact for controllable content modification in video synthesis workflows.

Abstract

Text-based diffusion video editing systems have been successful in performing edits with high fidelity and textual alignment. However, this success is limited to rigid-type editing such as style transfer and object overlay, while preserving the original structure of the input video. This limitation stems from an initial latent noise employed in diffusion video editing systems. The diffusion video editing systems prepare initial latent noise to edit by gradually infusing Gaussian noise onto the input video. However, we observed that the visual structure of the input video still persists within this initial latent noise, thereby restricting non-rigid editing such as motion change necessitating structural modifications. To this end, this paper proposes Dilutional Noise Initialization (DNI) framework which enables editing systems to perform precise and dynamic modification including non-rigid editing. DNI introduces a concept of `noise dilution' which adds further noise to the latent noise in the region to be edited to soften the structural rigidity imposed by input video, resulting in more effective edits closer to the target prompt. Extensive experiments demonstrate the effectiveness of the DNI framework.

DNI: Dilutional Noise Initialization for Diffusion Video Editing

TL;DR

DNI addresses the challenge of non-rigid editing in diffusion video editing, where the initial latent noise often preserves input structural cues that hinder dynamic motion changes. The authors propose a Dilutional Noise Initialization framework that disentangles the initial noise into a visual component and Gaussian noise via an adaptive spectral filter, then dilutes the visual noise in targeted editing regions guided by cross-attention-derived masks, producing a dilutional latent noise . The method is model-agnostic and plug-and-play, improving textual alignment, fidelity, and frame consistency across both tuning-based and tuning-free editors on DAVIS and TGVE benchmarks, with notable gains in non-rigid editing. These contributions enable more flexible, precise video editing with diffusion models, offering practical impact for controllable content modification in video synthesis workflows.

Abstract

Text-based diffusion video editing systems have been successful in performing edits with high fidelity and textual alignment. However, this success is limited to rigid-type editing such as style transfer and object overlay, while preserving the original structure of the input video. This limitation stems from an initial latent noise employed in diffusion video editing systems. The diffusion video editing systems prepare initial latent noise to edit by gradually infusing Gaussian noise onto the input video. However, we observed that the visual structure of the input video still persists within this initial latent noise, thereby restricting non-rigid editing such as motion change necessitating structural modifications. To this end, this paper proposes Dilutional Noise Initialization (DNI) framework which enables editing systems to perform precise and dynamic modification including non-rigid editing. DNI introduces a concept of `noise dilution' which adds further noise to the latent noise in the region to be edited to soften the structural rigidity imposed by input video, resulting in more effective edits closer to the target prompt. Extensive experiments demonstrate the effectiveness of the DNI framework.
Paper Structure (28 sections, 5 equations, 11 figures, 2 tables)

This paper contains 28 sections, 5 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Edited videos of Dilutional Noise Initialization (DNI) framework. DNI performs text-based rigid and non-rigid edits, enabling effective alteration under high fidelity.
  • Figure 2: (a) Editing results about motion change of current systems geyer2023tokenflowwu2022tune. (b) Categorical analysis of textual alignment with video across different types of editing on DAVIS pont20172017. (c) Overview of current diffusion video editing process. (d) Visualization of initial latent noise and the latent noise filtered by our designed adaptive spectral filter, where the input video's visual structure clearly remains in the initial latent noise.
  • Figure 3: (a) Illustration of Dilutional Noise Initialization framework. The noise disentanglement separates the initial latent noise into a visual branch and a noise branch. The visual branch contains a visual noise of input video components and the noise branch contains a Gaussian noise. The noise dilution adds further noise into an editing region of the visual noise, enabling dynamic modifications without being restricted by the input video layout. (b) Visualizations of initial and dilutional latent noises.
  • Figure 4: Illustration of Dilutional Noise Initialization (DNI) framework, which refines initial latent noise $z$ into dilutional latent noise $z^{\star}$, enabling editing systems to perform effective editing including non-rigid editing. DNI contains two main modules: (1) Noise Disentanglement which separates the noise $z$ into Gaussian noise $z_{g}$ and visual noise $z_{v}$ containing input video components and (2) Noise Dilution which adds a Gaussian noise $\epsilon$ on the $z_{v}$ to mitigate restrictions of the input video structure near the editing region. The noises $z_{v}$ and $z_{g}$ are recombined to build $z^{\star}$ for an input of video editing.
  • Figure 5: Discrete Fourier transform (DFT) of (a) initial latent noise $z$, (b) video latent feature $z_{0}$, and (c) white Gaussian noise $\epsilon$. A similar distribution between $z$ and $z_{0}$ (red circle) shows that $z$ contains the input video components (top: spatial domain 2D-DFT, bottom: temporal domain 1D-DFT).
  • ...and 6 more figures