V2Edit: Versatile Video Diffusion Editor for Videos and 3D Scenes
Yanming Zhang, Jun-Kun Chen, Jipeng Lyu, Yu-Xiong Wang
TL;DR
V$^2$Edit tackles the dual challenge of instruction-guided editing and original-content preservation in video and 3D scenes without requiring training data. It achieves this by a progression-based editing framework that decomposes complex edits into mild subtasks and employs three synergistic preservation mechanisms—initial noise control, per-step noise control, and cross-attention map control—guided by dual generations and efficient attention handling. The method extends naturally to 3D scenes via a render-edit-reconstruct pipeline, enabling high 3D consistency even under substantial geometric changes. Across video and 3D tasks, V$^2$Edit demonstrates state-of-the-art performance, strong content preservation, and practical efficiency, signaling a scalable approach for unified video and 3D editing with diffusion models.
Abstract
This paper introduces V$^2$Edit, a novel training-free framework for instruction-guided video and 3D scene editing. Addressing the critical challenge of balancing original content preservation with editing task fulfillment, our approach employs a progressive strategy that decomposes complex editing tasks into a sequence of simpler subtasks. Each subtask is controlled through three key synergistic mechanisms: the initial noise, noise added at each denoising step, and cross-attention maps between text prompts and video content. This ensures robust preservation of original video elements while effectively applying the desired edits. Beyond its native video editing capability, we extend V$^2$Edit to 3D scene editing via a "render-edit-reconstruct" process, enabling high-quality, 3D-consistent edits even for tasks involving substantial geometric changes such as object insertion. Extensive experiments demonstrate that our V$^2$Edit achieves high-quality and successful edits across various challenging video editing tasks and complex 3D scene editing tasks, thereby establishing state-of-the-art performance in both domains.
