Table of Contents
Fetching ...

VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

Paul Couairon, Clément Rambour, Jean-Emmanuel Haugeard, Nicolas Thome

TL;DR

VidEdit introduces a zero-shot, spatially aware video editing framework that fuses Neural Layered Atlases with a pre-trained diffusion model. By guiding atlas edits with Mask2Former segmentations and HED edges, it achieves fine-grained, temporally coherent modifications without per-prompt optimization. Evaluated on DAVIS, it outperforms baselines in semantic fidelity and preservation while delivering substantial speed gains. Limitations stem from atlas quality for highly dynamic scenes, suggesting future work on more robust atlas construction.

Abstract

Recently, diffusion-based generative models have achieved remarkable success for image generation and edition. However, existing diffusion-based video editing approaches lack the ability to offer precise control over generated content that maintains temporal consistency in long-term videos. On the other hand, atlas-based methods provide strong temporal consistency but are costly to edit a video and lack spatial control. In this work, we introduce VidEdit, a novel method for zero-shot text-based video editing that guarantees robust temporal and spatial consistency. In particular, we combine an atlas-based video representation with a pre-trained text-to-image diffusion model to provide a training-free and efficient video editing method, which by design fulfills temporal smoothness. To grant precise user control over generated content, we utilize conditional information extracted from off-the-shelf panoptic segmenters and edge detectors which guides the diffusion sampling process. This method ensures a fine spatial control on targeted regions while strictly preserving the structure of the original video. Our quantitative and qualitative experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset, regarding semantic faithfulness, image preservation, and temporal consistency metrics. With this framework, processing a single video only takes approximately one minute, and it can generate multiple compatible edits based on a unique text prompt. Project web-page at https://videdit.github.io

VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

TL;DR

VidEdit introduces a zero-shot, spatially aware video editing framework that fuses Neural Layered Atlases with a pre-trained diffusion model. By guiding atlas edits with Mask2Former segmentations and HED edges, it achieves fine-grained, temporally coherent modifications without per-prompt optimization. Evaluated on DAVIS, it outperforms baselines in semantic fidelity and preservation while delivering substantial speed gains. Limitations stem from atlas quality for highly dynamic scenes, suggesting future work on more robust atlas construction.

Abstract

Recently, diffusion-based generative models have achieved remarkable success for image generation and edition. However, existing diffusion-based video editing approaches lack the ability to offer precise control over generated content that maintains temporal consistency in long-term videos. On the other hand, atlas-based methods provide strong temporal consistency but are costly to edit a video and lack spatial control. In this work, we introduce VidEdit, a novel method for zero-shot text-based video editing that guarantees robust temporal and spatial consistency. In particular, we combine an atlas-based video representation with a pre-trained text-to-image diffusion model to provide a training-free and efficient video editing method, which by design fulfills temporal smoothness. To grant precise user control over generated content, we utilize conditional information extracted from off-the-shelf panoptic segmenters and edge detectors which guides the diffusion sampling process. This method ensures a fine spatial control on targeted regions while strictly preserving the structure of the original video. Our quantitative and qualitative experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset, regarding semantic faithfulness, image preservation, and temporal consistency metrics. With this framework, processing a single video only takes approximately one minute, and it can generate multiple compatible edits based on a unique text prompt. Project web-page at https://videdit.github.io
Paper Structure (16 sections, 9 equations, 12 figures, 2 tables)

This paper contains 16 sections, 9 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: VidEdit allows to perform rich and diverse video edits on a precise semantic region of interest while perfectly preserving untargeted areas. The method is lightweight and maintains a strong temporal consistency on long-term videos.
  • Figure 2: Our VidEdit pipeline: An input video (1) is fed into NLA models lna which learn to decompose it into 2D atlases (2). Depending on the object we want to edit, we select an atlas representation onto which we apply our editing diffusion pipeline (3). The edited atlas is then mapped back to frames via a bilinear sampling from the associated pre-trained network $\mathbb{M}$(4). Finally, the frame edit layers are composited over the original frames to obtain our desired edited video (5).
  • Figure 3: The three steps of our atlas editing procedure.
  • Figure 4: Masked LPIPS vs Local Object Accuracy. The size of each dot is proportional to the standard deviation of the local object accuracy.
  • Figure 5: Editing time.VidEdit can edit videos significantly faster than existing methods.
  • ...and 7 more figures