RealCraft: Attention Control as A Tool for Zero-Shot Consistent Video Editing

Shutong Jin; Ruiyu Wang; Florian T. Pokorny

RealCraft: Attention Control as A Tool for Zero-Shot Consistent Video Editing

Shutong Jin, Ruiyu Wang, Florian T. Pokorny

TL;DR

RealCraft tackles zero-shot real-video editing by introducing an attention-control pipeline that requires no extra inputs or model fine-tuning. It swaps cross-attention maps for editing prompts (CrossBlender) and relaxes spatial-temporal attention in feature-heavy areas (SpatialBlender), enabling significant shape edits with strong temporal coherence across up to 64 frames, implemented within a latent-diffusion framework using DDIM inversion. The approach leverages latent diffusion models with a deterministic inversion and a two-step attention-control loop, guided by a parameter-free process and a fixed editing prompt. Quantitative and qualitative evaluations against six baselines demonstrate improved editing fidelity, background transformation, and pose preservation, highlighting RealCraft’s practical impact for edit-centric video applications. The method paves the way for robust, prompt-driven editing of real videos and suggests future extensions to multi-modal guidance for broader control over object motion and semantics.

Abstract

Even though large-scale text-to-image generative models show promising performance in synthesizing high-quality images, applying these models directly to image editing remains a significant challenge. This challenge is further amplified in video editing due to the additional dimension of time. This is especially the case for editing real-world videos as it necessitates maintaining a stable structural layout across frames while executing localized edits without disrupting the existing content. In this paper, we propose RealCraft, an attention-control-based method for zero-shot real-world video editing. By swapping cross-attention for new feature injection and relaxing spatial-temporal attention of the editing object, we achieve localized shape-wise edit along with enhanced temporal consistency. Our model directly uses Stable Diffusion and operates without the need for additional information. We showcase the proposed zero-shot attention-control-based method across a range of videos, demonstrating shape-wise, time-consistent and parameter-free editing in videos of up to 64 frames.

RealCraft: Attention Control as A Tool for Zero-Shot Consistent Video Editing

TL;DR

Abstract

Paper Structure (26 sections, 9 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 26 sections, 9 equations, 7 figures, 1 table, 1 algorithm.

Introduction
Related Work
Diffusion Model for Text-driven Video Editing
Attention Control for Image and Video Editing
Methodology
Preliminary
Latent Diffusion Models.
DDIM Inversion and Sampling.
Attentions within RealCraft
Attention Control Module in RealCraft
CrossBlender
SpatialBlender
Algorithm
Experiments
Implementation Details
...and 11 more sections

Figures (7)

Figure 1: RealCraft enables zero-shot, shape-wise, consistent editing for real videos. Our method performs edits using Stable Diffusion, with text as the only input. No extra training or fine-tuning of models, structural guidance or parameter tuning is required.
Figure 2: (a) Our proposed RealCraft pipeline takes source frames $\{{x}_{i}\}_{i=1}^{n}$ (n = 8 in this illustration), source prompt, and editing prompt as inputs. Initially, $\{{x}_{i}\}_{i=1}^{n}$ are encoded into latent space by a VAE kingma2013auto encoder, followed by DDIM inversion to obtain the inverted latents, while storing spatial-temporal and cross-attention maps. In the denoising stage, the stored attention maps are fed into the Attention Control Module, orchestrating the spatial-temporal (SpatialBlender) and cross-attention (CrossBlender) for video editing. (b) Illustrations of spatial-temporal attention, cross attention, and temporal attention, with different colors representing the QKV components. Cross-attention occurs between the encoded prompt and frame. (b) The proposed Attention Control Module comprises CrossBlender and SpatialBlender.
Figure 3: A demonstration of the impact of blending threshold $\tau$ on blending mask.
Figure 4: Qualitative comparison with other baselines in background transformation.
Figure 5: Qualitative comparison with other baselines in shape editing: (a) $boat \rightarrow kayak$ and $hill \rightarrow forest$; (b) $helmet \rightarrow beret$ and $road \rightarrow grass$, and pose preservation: (c) $bear \rightarrow lion$; (d) $blackswan \rightarrow flamingo$
...and 2 more figures

RealCraft: Attention Control as A Tool for Zero-Shot Consistent Video Editing

TL;DR

Abstract

RealCraft: Attention Control as A Tool for Zero-Shot Consistent Video Editing

Authors

TL;DR

Abstract

Table of Contents

Figures (7)