Good Noise Makes Good Edits: A Training-Free Diffusion-Based Video Editing with Image and Text Prompts
Saemee Choi, Sohyun Jeong, Hyojin Jang, Jaegul Choo, Jinhee Kim
TL;DR
This work addresses the need for training-free video editing guided by both image and text prompts. It introduces VINO, a diffusion-based framework that constructs structured noise maps using $\rho$-start sampling and dilated dual masking, supplemented by zero image guidance, to achieve coherent, high-fidelity object edits while preserving background motion. The method operates without per-video training and demonstrates strong performance against state-of-the-art baselines across quantitative metrics and user studies. Its two-stage noise-map strategy and multimodal conditioning enable practical, efficient video editing adaptable to various text-to-video backbones.
Abstract
We propose VINO, the first zero-shot, training-free video editing method conditioned on both image and text. Our approach introduces $ρ$-start sampling and dilated dual masking to construct structured noise maps that enable coherent and accurate edits. To further enhance visual fidelity, we present zero image guidance, a controllable negative prompt strategy. Extensive experiments demonstrate that VINO faithfully incorporates the reference image into video edits, achieving strong performance compared to state-of-the-art baselines, all without any test-time or instance-specific training.
