Table of Contents
Fetching ...

RealCraft: Attention Control as A Tool for Zero-Shot Consistent Video Editing

Shutong Jin, Ruiyu Wang, Florian T. Pokorny

TL;DR

RealCraft tackles zero-shot real-video editing by introducing an attention-control pipeline that requires no extra inputs or model fine-tuning. It swaps cross-attention maps for editing prompts (CrossBlender) and relaxes spatial-temporal attention in feature-heavy areas (SpatialBlender), enabling significant shape edits with strong temporal coherence across up to 64 frames, implemented within a latent-diffusion framework using DDIM inversion. The approach leverages latent diffusion models with a deterministic inversion and a two-step attention-control loop, guided by a parameter-free process and a fixed editing prompt. Quantitative and qualitative evaluations against six baselines demonstrate improved editing fidelity, background transformation, and pose preservation, highlighting RealCraft’s practical impact for edit-centric video applications. The method paves the way for robust, prompt-driven editing of real videos and suggests future extensions to multi-modal guidance for broader control over object motion and semantics.

Abstract

Even though large-scale text-to-image generative models show promising performance in synthesizing high-quality images, applying these models directly to image editing remains a significant challenge. This challenge is further amplified in video editing due to the additional dimension of time. This is especially the case for editing real-world videos as it necessitates maintaining a stable structural layout across frames while executing localized edits without disrupting the existing content. In this paper, we propose RealCraft, an attention-control-based method for zero-shot real-world video editing. By swapping cross-attention for new feature injection and relaxing spatial-temporal attention of the editing object, we achieve localized shape-wise edit along with enhanced temporal consistency. Our model directly uses Stable Diffusion and operates without the need for additional information. We showcase the proposed zero-shot attention-control-based method across a range of videos, demonstrating shape-wise, time-consistent and parameter-free editing in videos of up to 64 frames.

RealCraft: Attention Control as A Tool for Zero-Shot Consistent Video Editing

TL;DR

RealCraft tackles zero-shot real-video editing by introducing an attention-control pipeline that requires no extra inputs or model fine-tuning. It swaps cross-attention maps for editing prompts (CrossBlender) and relaxes spatial-temporal attention in feature-heavy areas (SpatialBlender), enabling significant shape edits with strong temporal coherence across up to 64 frames, implemented within a latent-diffusion framework using DDIM inversion. The approach leverages latent diffusion models with a deterministic inversion and a two-step attention-control loop, guided by a parameter-free process and a fixed editing prompt. Quantitative and qualitative evaluations against six baselines demonstrate improved editing fidelity, background transformation, and pose preservation, highlighting RealCraft’s practical impact for edit-centric video applications. The method paves the way for robust, prompt-driven editing of real videos and suggests future extensions to multi-modal guidance for broader control over object motion and semantics.

Abstract

Even though large-scale text-to-image generative models show promising performance in synthesizing high-quality images, applying these models directly to image editing remains a significant challenge. This challenge is further amplified in video editing due to the additional dimension of time. This is especially the case for editing real-world videos as it necessitates maintaining a stable structural layout across frames while executing localized edits without disrupting the existing content. In this paper, we propose RealCraft, an attention-control-based method for zero-shot real-world video editing. By swapping cross-attention for new feature injection and relaxing spatial-temporal attention of the editing object, we achieve localized shape-wise edit along with enhanced temporal consistency. Our model directly uses Stable Diffusion and operates without the need for additional information. We showcase the proposed zero-shot attention-control-based method across a range of videos, demonstrating shape-wise, time-consistent and parameter-free editing in videos of up to 64 frames.
Paper Structure (26 sections, 9 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 26 sections, 9 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: RealCraft enables zero-shot, shape-wise, consistent editing for real videos. Our method performs edits using Stable Diffusion, with text as the only input. No extra training or fine-tuning of models, structural guidance or parameter tuning is required.
  • Figure 2: (a) Our proposed RealCraft pipeline takes source frames $\{{x}_{i}\}_{i=1}^{n}$ (n = 8 in this illustration), source prompt, and editing prompt as inputs. Initially, $\{{x}_{i}\}_{i=1}^{n}$ are encoded into latent space by a VAE kingma2013auto encoder, followed by DDIM inversion to obtain the inverted latents, while storing spatial-temporal and cross-attention maps. In the denoising stage, the stored attention maps are fed into the Attention Control Module, orchestrating the spatial-temporal (SpatialBlender) and cross-attention (CrossBlender) for video editing. (b) Illustrations of spatial-temporal attention, cross attention, and temporal attention, with different colors representing the QKV components. Cross-attention occurs between the encoded prompt and frame. (b) The proposed Attention Control Module comprises CrossBlender and SpatialBlender.
  • Figure 3: A demonstration of the impact of blending threshold $\tau$ on blending mask.
  • Figure 4: Qualitative comparison with other baselines in background transformation.
  • Figure 5: Qualitative comparison with other baselines in shape editing: (a) $boat \rightarrow kayak$ and $hill \rightarrow forest$; (b) $helmet \rightarrow beret$ and $road \rightarrow grass$, and pose preservation: (c) $bear \rightarrow lion$; (d) $blackswan \rightarrow flamingo$
  • ...and 2 more figures