Table of Contents
Fetching ...

Good Noise Makes Good Edits: A Training-Free Diffusion-Based Video Editing with Image and Text Prompts

Saemee Choi, Sohyun Jeong, Hyojin Jang, Jaegul Choo, Jinhee Kim

TL;DR

This work addresses the need for training-free video editing guided by both image and text prompts. It introduces VINO, a diffusion-based framework that constructs structured noise maps using $\rho$-start sampling and dilated dual masking, supplemented by zero image guidance, to achieve coherent, high-fidelity object edits while preserving background motion. The method operates without per-video training and demonstrates strong performance against state-of-the-art baselines across quantitative metrics and user studies. Its two-stage noise-map strategy and multimodal conditioning enable practical, efficient video editing adaptable to various text-to-video backbones.

Abstract

We propose VINO, the first zero-shot, training-free video editing method conditioned on both image and text. Our approach introduces $ρ$-start sampling and dilated dual masking to construct structured noise maps that enable coherent and accurate edits. To further enhance visual fidelity, we present zero image guidance, a controllable negative prompt strategy. Extensive experiments demonstrate that VINO faithfully incorporates the reference image into video edits, achieving strong performance compared to state-of-the-art baselines, all without any test-time or instance-specific training.

Good Noise Makes Good Edits: A Training-Free Diffusion-Based Video Editing with Image and Text Prompts

TL;DR

This work addresses the need for training-free video editing guided by both image and text prompts. It introduces VINO, a diffusion-based framework that constructs structured noise maps using -start sampling and dilated dual masking, supplemented by zero image guidance, to achieve coherent, high-fidelity object edits while preserving background motion. The method operates without per-video training and demonstrates strong performance against state-of-the-art baselines across quantitative metrics and user studies. Its two-stage noise-map strategy and multimodal conditioning enable practical, efficient video editing adaptable to various text-to-video backbones.

Abstract

We propose VINO, the first zero-shot, training-free video editing method conditioned on both image and text. Our approach introduces -start sampling and dilated dual masking to construct structured noise maps that enable coherent and accurate edits. To further enhance visual fidelity, we present zero image guidance, a controllable negative prompt strategy. Extensive experiments demonstrate that VINO faithfully incorporates the reference image into video edits, achieving strong performance compared to state-of-the-art baselines, all without any test-time or instance-specific training.

Paper Structure

This paper contains 28 sections, 10 equations, 19 figures, 3 tables.

Figures (19)

  • Figure 1: Qualitative results of VINO. Our training-free method, VINO, successfully achieves video edits using text and image prompts.
  • Figure 2: Importance of appropriate integration of image prompts in video editing. Given (a) a source video, we compare (b) a state-of-the-art text-driven video editing method, (c) a naive combination of a pretrained T2V model with an image-guided module, and (d) our proposed method. The text-only approach in (b) struggles to accurately transfer the structural appearance of the car in the reference image. Simply attaching the image-guided module to a pretrained T2V model, as in (c), produces unnatural results. In contrast, our proposed training-free method in (d) precisely transfers the car in the reference image into the video with high fidelity.
  • Figure 3: Runtime comparison across video resolutions. We compare VINO with Make-A-Pro make_a_pro, an image-guided video editing method. Results are averaged over three runs per setting.
  • Figure 4: Overview of VINO. Our proposed training-free approach, VINO, adopts a coarse-to-fine scheme to guide the two-stage editing process via strategically designed noise maps. Both stages leverage $\rho$-start sampling with text and image prompts. In the first stage, a rough noise map is constructed from the source video via dilated masking (a-1), which is then denoised by the pretrained T2V model to yield a rough edit (b-1). In the second stage, a refined good noise map is derived from the rough edit via dilated dual masking (a-2). Finally, the model produces the final output based on this good noise map (b-2), faithfully reflecting the prompts while preserving motion and structure.
  • Figure 5: Controlled noise sampling for structured edits in VINO. Foreground and background latents, obtained via $\rho$-start sampling and zero image guidance, are blended using a final mask formed by the union of dilated source and target masks. This enables clean removal of the source object and seamless insertion of the target.
  • ...and 14 more figures