VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing
Paul Couairon, Clément Rambour, Jean-Emmanuel Haugeard, Nicolas Thome
TL;DR
VidEdit introduces a zero-shot, spatially aware video editing framework that fuses Neural Layered Atlases with a pre-trained diffusion model. By guiding atlas edits with Mask2Former segmentations and HED edges, it achieves fine-grained, temporally coherent modifications without per-prompt optimization. Evaluated on DAVIS, it outperforms baselines in semantic fidelity and preservation while delivering substantial speed gains. Limitations stem from atlas quality for highly dynamic scenes, suggesting future work on more robust atlas construction.
Abstract
Recently, diffusion-based generative models have achieved remarkable success for image generation and edition. However, existing diffusion-based video editing approaches lack the ability to offer precise control over generated content that maintains temporal consistency in long-term videos. On the other hand, atlas-based methods provide strong temporal consistency but are costly to edit a video and lack spatial control. In this work, we introduce VidEdit, a novel method for zero-shot text-based video editing that guarantees robust temporal and spatial consistency. In particular, we combine an atlas-based video representation with a pre-trained text-to-image diffusion model to provide a training-free and efficient video editing method, which by design fulfills temporal smoothness. To grant precise user control over generated content, we utilize conditional information extracted from off-the-shelf panoptic segmenters and edge detectors which guides the diffusion sampling process. This method ensures a fine spatial control on targeted regions while strictly preserving the structure of the original video. Our quantitative and qualitative experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset, regarding semantic faithfulness, image preservation, and temporal consistency metrics. With this framework, processing a single video only takes approximately one minute, and it can generate multiple compatible edits based on a unique text prompt. Project web-page at https://videdit.github.io
