Table of Contents
Fetching ...

VideoDirector: Precise Video Editing via Text-to-Video Models

Yukun Wang, Longguang Wang, Zhiyuan Ma, Qibin Hu, Kai Xu, Yulan Guo

TL;DR

VideoDirector tackles the problem of precise video editing with text-to-video models by addressing two core issues: tight spatial-temporal coupling and complex layout, which cause artifacts in traditional inversion-based editing. It introduces spatial-temporal decoupled guidance (STDG) and multi-frame null-text optimization to provide temporal cues for pivotal inversion, plus a self-attention control strategy to maintain a faithful spatial-temporal layout. The method aligns the diffusion backward trajectory with DDIM inversion and uses mutual attention with frame-aware masks to preserve unedited content while applying edits, achieving higher accuracy, motion smoothness, realism, and fidelity than state-of-the-art approaches. Experiments on 75 editing pairs demonstrate substantial improvements across objective metrics and a user study, indicating practical viability for high-fidelity, temporally coherent video editing directly via T2V models.

Abstract

Despite the typical inversion-then-editing paradigm using text-to-image (T2I) models has demonstrated promising results, directly extending it to text-to-video (T2V) models still suffers severe artifacts such as color flickering and content distortion. Consequently, current video editing methods primarily rely on T2I models, which inherently lack temporal-coherence generative ability, often resulting in inferior editing results. In this paper, we attribute the failure of the typical editing paradigm to: 1) Tightly Spatial-temporal Coupling. The vanilla pivotal-based inversion strategy struggles to disentangle spatial-temporal information in the video diffusion model; 2) Complicated Spatial-temporal Layout. The vanilla cross-attention control is deficient in preserving the unedited content. To address these limitations, we propose a spatial-temporal decoupled guidance (STDG) and multi-frame null-text optimization strategy to provide pivotal temporal cues for more precise pivotal inversion. Furthermore, we introduce a self-attention control strategy to maintain higher fidelity for precise partial content editing. Experimental results demonstrate that our method (termed VideoDirector) effectively harnesses the powerful temporal generation capabilities of T2V models, producing edited videos with state-of-the-art performance in accuracy, motion smoothness, realism, and fidelity to unedited content.

VideoDirector: Precise Video Editing via Text-to-Video Models

TL;DR

VideoDirector tackles the problem of precise video editing with text-to-video models by addressing two core issues: tight spatial-temporal coupling and complex layout, which cause artifacts in traditional inversion-based editing. It introduces spatial-temporal decoupled guidance (STDG) and multi-frame null-text optimization to provide temporal cues for pivotal inversion, plus a self-attention control strategy to maintain a faithful spatial-temporal layout. The method aligns the diffusion backward trajectory with DDIM inversion and uses mutual attention with frame-aware masks to preserve unedited content while applying edits, achieving higher accuracy, motion smoothness, realism, and fidelity than state-of-the-art approaches. Experiments on 75 editing pairs demonstrate substantial improvements across objective metrics and a user study, indicating practical viability for high-fidelity, temporally coherent video editing directly via T2V models.

Abstract

Despite the typical inversion-then-editing paradigm using text-to-image (T2I) models has demonstrated promising results, directly extending it to text-to-video (T2V) models still suffers severe artifacts such as color flickering and content distortion. Consequently, current video editing methods primarily rely on T2I models, which inherently lack temporal-coherence generative ability, often resulting in inferior editing results. In this paper, we attribute the failure of the typical editing paradigm to: 1) Tightly Spatial-temporal Coupling. The vanilla pivotal-based inversion strategy struggles to disentangle spatial-temporal information in the video diffusion model; 2) Complicated Spatial-temporal Layout. The vanilla cross-attention control is deficient in preserving the unedited content. To address these limitations, we propose a spatial-temporal decoupled guidance (STDG) and multi-frame null-text optimization strategy to provide pivotal temporal cues for more precise pivotal inversion. Furthermore, we introduce a self-attention control strategy to maintain higher fidelity for precise partial content editing. Experimental results demonstrate that our method (termed VideoDirector) effectively harnesses the powerful temporal generation capabilities of T2V models, producing edited videos with state-of-the-art performance in accuracy, motion smoothness, realism, and fidelity to unedited content.

Paper Structure

This paper contains 13 sections, 12 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Edited results. Our method enables precise content editing of an input video based on a text prompt, while preserving unedited content. By directly leveraging the text-to-video (T2V) generation model guo2023animatediff for editing, the edited results exhibit high fidelity, real-world motion smoothness, and enhanced realism.
  • Figure 2: Principle visualization of our approach. Comparison of diffusion pivotal inversion mokady2022null using a T2V generation model guo2023animatediff integrated with vanilla null-text optimization (a) and our proposed guidance (b). Our approach constrains the reverse diffusion trajectory during video generation to align with DDIM inversion, enabling precise reconstruction of the input video.
  • Figure 3: Video pivotal inversion pipeline. Our pipeline comprises two key components: multi-frame null-text optimization and spatial-temporal decoupled guidance, which are integrated into the standard pivotal inversion pipeline.
  • Figure 4: Our video editing pipeline. The SA-I and SA-II maintain the complicated spatial-temporal layout and enhance fidelity, while the cross-attention control introduces editing guidance based on the editing prompts.
  • Figure 5: Edited results. The edited videos demonstrate our method's effectiveness in terms of accuracy, fidelity, motion smoothness, and realism. Moreover, the edited videos illustrate superior harmony, seamlessly integrating the edited content into the original unedited environment and context.
  • ...and 4 more figures