Table of Contents
Fetching ...

V2Edit: Versatile Video Diffusion Editor for Videos and 3D Scenes

Yanming Zhang, Jun-Kun Chen, Jipeng Lyu, Yu-Xiong Wang

TL;DR

V$^2$Edit tackles the dual challenge of instruction-guided editing and original-content preservation in video and 3D scenes without requiring training data. It achieves this by a progression-based editing framework that decomposes complex edits into mild subtasks and employs three synergistic preservation mechanisms—initial noise control, per-step noise control, and cross-attention map control—guided by dual generations and efficient attention handling. The method extends naturally to 3D scenes via a render-edit-reconstruct pipeline, enabling high 3D consistency even under substantial geometric changes. Across video and 3D tasks, V$^2$Edit demonstrates state-of-the-art performance, strong content preservation, and practical efficiency, signaling a scalable approach for unified video and 3D editing with diffusion models.

Abstract

This paper introduces V$^2$Edit, a novel training-free framework for instruction-guided video and 3D scene editing. Addressing the critical challenge of balancing original content preservation with editing task fulfillment, our approach employs a progressive strategy that decomposes complex editing tasks into a sequence of simpler subtasks. Each subtask is controlled through three key synergistic mechanisms: the initial noise, noise added at each denoising step, and cross-attention maps between text prompts and video content. This ensures robust preservation of original video elements while effectively applying the desired edits. Beyond its native video editing capability, we extend V$^2$Edit to 3D scene editing via a "render-edit-reconstruct" process, enabling high-quality, 3D-consistent edits even for tasks involving substantial geometric changes such as object insertion. Extensive experiments demonstrate that our V$^2$Edit achieves high-quality and successful edits across various challenging video editing tasks and complex 3D scene editing tasks, thereby establishing state-of-the-art performance in both domains.

V2Edit: Versatile Video Diffusion Editor for Videos and 3D Scenes

TL;DR

VEdit tackles the dual challenge of instruction-guided editing and original-content preservation in video and 3D scenes without requiring training data. It achieves this by a progression-based editing framework that decomposes complex edits into mild subtasks and employs three synergistic preservation mechanisms—initial noise control, per-step noise control, and cross-attention map control—guided by dual generations and efficient attention handling. The method extends naturally to 3D scenes via a render-edit-reconstruct pipeline, enabling high 3D consistency even under substantial geometric changes. Across video and 3D tasks, VEdit demonstrates state-of-the-art performance, strong content preservation, and practical efficiency, signaling a scalable approach for unified video and 3D editing with diffusion models.

Abstract

This paper introduces VEdit, a novel training-free framework for instruction-guided video and 3D scene editing. Addressing the critical challenge of balancing original content preservation with editing task fulfillment, our approach employs a progressive strategy that decomposes complex editing tasks into a sequence of simpler subtasks. Each subtask is controlled through three key synergistic mechanisms: the initial noise, noise added at each denoising step, and cross-attention maps between text prompts and video content. This ensures robust preservation of original video elements while effectively applying the desired edits. Beyond its native video editing capability, we extend VEdit to 3D scene editing via a "render-edit-reconstruct" process, enabling high-quality, 3D-consistent edits even for tasks involving substantial geometric changes such as object insertion. Extensive experiments demonstrate that our VEdit achieves high-quality and successful edits across various challenging video editing tasks and complex 3D scene editing tasks, thereby establishing state-of-the-art performance in both domains.

Paper Structure

This paper contains 34 sections, 5 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Our V$^2$Edit is a versatile approach that supports training-free instruction-guided editing for both videos and 3D scenes. Left: V$^2$Edit achieves high-quality editing satisfying both original content preservation and editing instruction fulfillment in video editing. Right: V$^2$Edit supports challenging 3D scene editing tasks involving significant geometric changes, which baselines in2nproedit fail to achieve.
  • Figure 2: Our V$^2$Edit framework features progressive editing. Given an editing instruction and the original video, a large vision-language model (LVLM) gpt4o generates prompts for both the original and edited videos. These prompts are interpolated to create a sequence of subtasks, which are executed progressively in our framework.
  • Figure 3: V$^2$Edit preservation control integrates three key synergistic methods to preserve the original content during editing: (i) control of the initial noise ('$\alpha T$'), (ii) management of noise added at each denoising step ('DDPM Latents'), and (iii) utilization of cross-attention maps between text prompts and video content. Each generation receives guidance on preservation from the previous subtask and the original video for a smooth progression.
  • Figure 4: Our V$^2$Edit achieves successful editing results in various video editing tasks with superior overall appearance, while well preserving the original contents. The baselines either generate results with strange appearance and artifacts, or fail to preserve the areas unrelated to the editing. Notably, CogVideoX-V2V cogvideox, an official video-to-video editing model of CogVideoX, generates good-looking results but is unable to preserve the original contents, showing that the key of our V$^2$Edit lies in our novel progression framework and preservation control mechanism, instead of the strong underlying CogVideoX model. More results are on https://immortalco.github.io/V2Edit/.
  • Figure 5: Our V$^2$Edit achieves high-quality editing results in various challenging 3D scene editing tasks in the Face scene of the IN2N in2n dataset, with clear texture and geometry structure, bright color, and superior original content preservation. Notably, our V$^2$Edit successfully performs editing operations with significant geometric changes like object insertion. On the contrary, the baselines either fail to perform the editing or do not preserve the contents in the original scene, e.g., the background color, the appearance of the person, etc.
  • ...and 5 more figures