DNI: Dilutional Noise Initialization for Diffusion Video Editing
Sunjae Yoon, Gwanhyeong Koo, Ji Woo Hong, Chang D. Yoo
TL;DR
DNI addresses the challenge of non-rigid editing in diffusion video editing, where the initial latent noise often preserves input structural cues that hinder dynamic motion changes. The authors propose a Dilutional Noise Initialization framework that disentangles the initial noise into a visual component and Gaussian noise via an adaptive spectral filter, then dilutes the visual noise in targeted editing regions guided by cross-attention-derived masks, producing a dilutional latent noise $z^{\star}$. The method is model-agnostic and plug-and-play, improving textual alignment, fidelity, and frame consistency across both tuning-based and tuning-free editors on DAVIS and TGVE benchmarks, with notable gains in non-rigid editing. These contributions enable more flexible, precise video editing with diffusion models, offering practical impact for controllable content modification in video synthesis workflows.
Abstract
Text-based diffusion video editing systems have been successful in performing edits with high fidelity and textual alignment. However, this success is limited to rigid-type editing such as style transfer and object overlay, while preserving the original structure of the input video. This limitation stems from an initial latent noise employed in diffusion video editing systems. The diffusion video editing systems prepare initial latent noise to edit by gradually infusing Gaussian noise onto the input video. However, we observed that the visual structure of the input video still persists within this initial latent noise, thereby restricting non-rigid editing such as motion change necessitating structural modifications. To this end, this paper proposes Dilutional Noise Initialization (DNI) framework which enables editing systems to perform precise and dynamic modification including non-rigid editing. DNI introduces a concept of `noise dilution' which adds further noise to the latent noise in the region to be edited to soften the structural rigidity imposed by input video, resulting in more effective edits closer to the target prompt. Extensive experiments demonstrate the effectiveness of the DNI framework.
