Table of Contents
Fetching ...

Blended Latent Diffusion under Attention Control for Real-World Video Editing

Deyin Liu, Lin Yuanbo Wu, Xianghua Xie

TL;DR

The work tackles local video editing with image-based diffusion models, addressing background preservation, mask generation, and temporal consistency. It proposes Blend Latent Diffusion under Attention Control, combining DDIM inversion for deterministic background latents, autonomous masking via cross-attention with thresholding, and temporal-spatial attention to enforce inter-frame coherence, all without additional training. Key contributions include a DDIM-based background latent strategy, an online masking mechanism derived from cross-attention maps, and a training-free temporal-spatial attention module that preserves motion and appearance across frames. The approach enables robust real-world video edits such as attribute changes and object category replacements, with practical impact on video editing workflows and accessibility of high-quality edits using public diffusion priors.

Abstract

Due to lack of fully publicly available text-to-video models, current video editing methods tend to build on pre-trained text-to-image generation models, however, they still face grand challenges in dealing with the local editing of video with temporal information. First, although existing methods attempt to focus on local area editing by a pre-defined mask, the preservation of the outside-area background is non-ideal due to the spatially entire generation of each frame. In addition, specially providing a mask by user is an additional costly undertaking, so an autonomous masking strategy integrated into the editing process is desirable. Last but not least, image-level pretrained model hasn't learned temporal information across frames of a video which is vital for expressing the motion and dynamics. In this paper, we propose to adapt a image-level blended latent diffusion model to perform local video editing tasks. Specifically, we leverage DDIM inversion to acquire the latents as background latents instead of the randomly noised ones to better preserve the background information of the input video. We further introduce an autonomous mask manufacture mechanism derived from cross-attention maps in diffusion steps. Finally, we enhance the temporal consistency across video frames by transforming the self-attention blocks of U-Net into temporal-spatial blocks. Through extensive experiments, our proposed approach demonstrates effectiveness in different real-world video editing tasks.

Blended Latent Diffusion under Attention Control for Real-World Video Editing

TL;DR

The work tackles local video editing with image-based diffusion models, addressing background preservation, mask generation, and temporal consistency. It proposes Blend Latent Diffusion under Attention Control, combining DDIM inversion for deterministic background latents, autonomous masking via cross-attention with thresholding, and temporal-spatial attention to enforce inter-frame coherence, all without additional training. Key contributions include a DDIM-based background latent strategy, an online masking mechanism derived from cross-attention maps, and a training-free temporal-spatial attention module that preserves motion and appearance across frames. The approach enables robust real-world video edits such as attribute changes and object category replacements, with practical impact on video editing workflows and accessibility of high-quality edits using public diffusion priors.

Abstract

Due to lack of fully publicly available text-to-video models, current video editing methods tend to build on pre-trained text-to-image generation models, however, they still face grand challenges in dealing with the local editing of video with temporal information. First, although existing methods attempt to focus on local area editing by a pre-defined mask, the preservation of the outside-area background is non-ideal due to the spatially entire generation of each frame. In addition, specially providing a mask by user is an additional costly undertaking, so an autonomous masking strategy integrated into the editing process is desirable. Last but not least, image-level pretrained model hasn't learned temporal information across frames of a video which is vital for expressing the motion and dynamics. In this paper, we propose to adapt a image-level blended latent diffusion model to perform local video editing tasks. Specifically, we leverage DDIM inversion to acquire the latents as background latents instead of the randomly noised ones to better preserve the background information of the input video. We further introduce an autonomous mask manufacture mechanism derived from cross-attention maps in diffusion steps. Finally, we enhance the temporal consistency across video frames by transforming the self-attention blocks of U-Net into temporal-spatial blocks. Through extensive experiments, our proposed approach demonstrates effectiveness in different real-world video editing tasks.
Paper Structure (13 sections, 6 equations, 3 figures, 2 tables)

This paper contains 13 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Examples of local video editing achieved by our proposed method
  • Figure 2: Performance comparisons between using the DDIM inverted latents and previous randomly noised ones
  • Figure 3: Performance comparisons with/without the proposed temporal-spatial attention