Table of Contents
Fetching ...

Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion

Xiang Fan, Anand Bhattad, Ranjay Krishna

TL;DR

Videoshop addresses the challenge of precise, localized semantic video editing without model retraining. It introduces inversion with noise extrapolation and latent normalization within a four-stage pipeline that propagates first-frame edits across a video while preserving motion and semantics. The approach, grounded in latent diffusion and the EDM framework, demonstrates superior edit fidelity, source faithfulness, and temporal consistency across benchmarks, with substantial efficiency gains. These results enable Photoshop-like image edits to be extended to videos, reducing manual per-frame work and broadening practical editing applications.

Abstract

We introduce Videoshop, a training-free video editing algorithm for localized semantic edits. Videoshop allows users to use any editing software, including Photoshop and generative inpainting, to modify the first frame; it automatically propagates those changes, with semantic, spatial, and temporally consistent motion, to the remaining frames. Unlike existing methods that enable edits only through imprecise textual instructions, Videoshop allows users to add or remove objects, semantically change objects, insert stock photos into videos, etc. with fine-grained control over locations and appearance. We achieve this through image-based video editing by inverting latents with noise extrapolation, from which we generate videos conditioned on the edited image. Videoshop produces higher quality edits against 6 baselines on 2 editing benchmarks using 10 evaluation metrics.

Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion

TL;DR

Videoshop addresses the challenge of precise, localized semantic video editing without model retraining. It introduces inversion with noise extrapolation and latent normalization within a four-stage pipeline that propagates first-frame edits across a video while preserving motion and semantics. The approach, grounded in latent diffusion and the EDM framework, demonstrates superior edit fidelity, source faithfulness, and temporal consistency across benchmarks, with substantial efficiency gains. These results enable Photoshop-like image edits to be extended to videos, reducing manual per-frame work and broadening practical editing applications.

Abstract

We introduce Videoshop, a training-free video editing algorithm for localized semantic edits. Videoshop allows users to use any editing software, including Photoshop and generative inpainting, to modify the first frame; it automatically propagates those changes, with semantic, spatial, and temporally consistent motion, to the remaining frames. Unlike existing methods that enable edits only through imprecise textual instructions, Videoshop allows users to add or remove objects, semantically change objects, insert stock photos into videos, etc. with fine-grained control over locations and appearance. We achieve this through image-based video editing by inverting latents with noise extrapolation, from which we generate videos conditioned on the edited image. Videoshop produces higher quality edits against 6 baselines on 2 editing benchmarks using 10 evaluation metrics.
Paper Structure (23 sections, 8 equations, 7 figures, 5 tables)

This paper contains 23 sections, 8 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Videoshop is a training-free method for precise video editing. Given an original video and user edits to the first frame, Videoshop automatically propagates the changes to all the frames of the video while maintaining semantic, geometric, and temporal consistency. To edit the first frame, users can leverage image editing tools, including text-based inpainting and professional editing software like Photoshop. As such, Videoshop supports video edits congruent with possible image edits that Photoshop enables: users can add new objects, remove objects or their parts, modify attributes, etc.
  • Figure 2: Overview of Videoshop for localized semantic video editing. Our contributions are highlighted with red boxes and arrows. Our method includes four primary stages: (1) Encode & Norm, where the input video is encoded into a latent space using a VAE encoder, followed by normalization to ensure stability throughout inversion. (2) In the Inversion w/ Noise Extrapolation phase, noise extrapolation is systematically applied at each step to provide a corrective term that guides the inversion trajectory, ensuring the video is mapped to correct latent noise. This step is key for aligning the latent space trajectory at every timestep. (3) Diffusion then ensures user edits are seamlessly integrated across the video sequence, enforcing consistency while diffusing the initial modifications through time. (4) The last step is Rescale & Decode, where the now-edited latent sequence is rescaled to align with the original data's statistical distribution and decoded back into the video, resulting in an output video that reflects the desired semantic edits while maintaining the natural flow of the original sequence.
  • Figure 3: Cosine similarity matrix for pairs of latent vectors throughout the denoising process. The latent vectors are approximately collinear, which supports our linear noise extrapolation.
  • Figure 4: Examples of edited videos. Our method handles a diverse set of edit types; examples shown include appearance editing, object removal, semantic editing, and shape/texture editing. Videoshop successfully performs precise local edits while maintaining high visual fidelity to the source video.
  • Figure 5: Qualitative comparison against baselines. Videoshop successfully maintains visual fidelity to the source video and target edit, while existing methods fail to do so.
  • ...and 2 more figures