Table of Contents
Fetching ...

Shaping a Stabilized Video by Mitigating Unintended Changes for Concept-Augmented Video Editing

Mingce Guo, Jingxuan He, Shengeng Tang, Zhangye Wang, Lechao Cheng

TL;DR

The paper tackles the limited expressiveness of word embeddings and attention instability in text-driven video editing with diffusion models. It introduces Concept-Augmented Textual Inversion (CATI), which uses LoRA to adapt value projections for external concept videos, and Dual Prior Supervision (DPS) to constrain cross-attention during editing. Through a two-stage training and inference scheme that blends self-attention and swaps cross-attention, the method achieves improved frame consistency, non-target area stability, and concept fidelity, outperforming state-of-the-art baselines. The approach enables flexible, one-shot editing of videos with stylized results while maintaining temporal and spatial coherence, expanding practical applications in film, art, and advertising.

Abstract

Text-driven video editing utilizing generative diffusion models has garnered significant attention due to their potential applications. However, existing approaches are constrained by the limited word embeddings provided in pre-training, which hinders nuanced editing targeting open concepts with specific attributes. Directly altering the keywords in target prompts often results in unintended disruptions to the attention mechanisms. To achieve more flexible editing easily, this work proposes an improved concept-augmented video editing approach that generates diverse and stable target videos flexibly by devising abstract conceptual pairs. Specifically, the framework involves concept-augmented textual inversion and a dual prior supervision mechanism. The former enables plug-and-play guidance of stable diffusion for video editing, effectively capturing target attributes for more stylized results. The dual prior supervision mechanism significantly enhances video stability and fidelity. Comprehensive evaluations demonstrate that our approach generates more stable and lifelike videos, outperforming state-of-the-art methods.

Shaping a Stabilized Video by Mitigating Unintended Changes for Concept-Augmented Video Editing

TL;DR

The paper tackles the limited expressiveness of word embeddings and attention instability in text-driven video editing with diffusion models. It introduces Concept-Augmented Textual Inversion (CATI), which uses LoRA to adapt value projections for external concept videos, and Dual Prior Supervision (DPS) to constrain cross-attention during editing. Through a two-stage training and inference scheme that blends self-attention and swaps cross-attention, the method achieves improved frame consistency, non-target area stability, and concept fidelity, outperforming state-of-the-art baselines. The approach enables flexible, one-shot editing of videos with stylized results while maintaining temporal and spatial coherence, expanding practical applications in film, art, and advertising.

Abstract

Text-driven video editing utilizing generative diffusion models has garnered significant attention due to their potential applications. However, existing approaches are constrained by the limited word embeddings provided in pre-training, which hinders nuanced editing targeting open concepts with specific attributes. Directly altering the keywords in target prompts often results in unintended disruptions to the attention mechanisms. To achieve more flexible editing easily, this work proposes an improved concept-augmented video editing approach that generates diverse and stable target videos flexibly by devising abstract conceptual pairs. Specifically, the framework involves concept-augmented textual inversion and a dual prior supervision mechanism. The former enables plug-and-play guidance of stable diffusion for video editing, effectively capturing target attributes for more stylized results. The dual prior supervision mechanism significantly enhances video stability and fidelity. Comprehensive evaluations demonstrate that our approach generates more stable and lifelike videos, outperforming state-of-the-art methods.

Paper Structure

This paper contains 12 sections, 9 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of our training and inference pipelines. During the training stage, we first adapt the diffusion model to new visual concepts using our introduced Concept-Augmented Textual Inversion (CATI), and then we tune the temporally extended diffusion model with our proposed Dual Prior Supervision (DPS) mechanism to prevent unintended changes in edited videos. During the inference stage, we blend self-attention matrices (Self-Attention Blending) and swap cross-attention matrices (Cross-Attention Swap) to achieve stable video editing.
  • Figure 2: Visualization of the dual prior supervision mechanism. Each row displays a video frame, a set of cross-attention maps between this video frame and prompt words, and a pseudo ground truth mask. The scam loss and tcam loss are computed between relevant words and pseudo masks to reduce unintended changes.
  • Figure 3: Video generation with (Setting I) and without (Setting II) concept pairs. The top row of the figure contains the concept video with its prompt. The second row is the source video frames coupled with prompts that need to be edited. The rows below show the editing results of the source video using the editing prompt for wu2023tune, qi2023fatezero, zhao2023motiondirector, kara2024rave and our method, respectively, in which words with "$" ahead mean concept words, and the same for subsequent results.
  • Figure 4: Comparison of whether to use Concept Augmentation (CA) for textual inversion. Compared the text inversion results without and with concept augmentation for pairs (a), (b): 'jeep' $\to$'$LAMBO'; and (c), (d): 'jeep' $\to$'$CYBERTRUCK', respectively, from the same source prompt "a jeep driving down a curvy road in the countryside".
  • Figure 5: The impact of dual prior supervision. From the first to the last row, using the editing example in Fig. \ref{['fig:overview']}, we compare the average cross-attention maps and the editing results with and without the supervision mechanism of scam and tcam. Each case contains three pairs, and each pair consists of an average cross-attention map on the left and an edited frame on the right.
  • ...and 1 more figures