EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing

Xiangpeng Yang; Linchao Zhu; Hehe Fan; Yi Yang

EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing

Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Yang

TL;DR

Benefiting from the precise attention weight distribution, EVA can be easily generalized to multi-object editing scenarios and achieves accurate identity mapping.

Abstract

Current diffusion-based video editing primarily focuses on local editing (\textit{e.g.,} object/background editing) or global style editing by utilizing various dense correspondences. However, these methods often fail to accurately edit the foreground and background simultaneously while preserving the original layout. We find that the crux of the issue stems from the imprecise distribution of attention weights across designated regions, including inaccurate text-to-attribute control and attention leakage. To tackle this issue, we introduce EVA, a \textbf{zero-shot} and \textbf{multi-attribute} video editing framework tailored for human-centric videos with complex motions. We incorporate a Spatial-Temporal Layout-Guided Attention mechanism that leverages the intrinsic positive and negative correspondences of cross-frame diffusion features. To avoid attention leakage, we utilize these correspondences to boost the attention scores of tokens within the same attribute across all video frames while limiting interactions between tokens of different attributes in the self-attention layer. For precise text-to-attribute manipulation, we use discrete text embeddings focused on specific layout areas within the cross-attention layer. Benefiting from the precise attention weight distribution, EVA can be easily generalized to multi-object editing scenarios and achieves accurate identity mapping. Extensive experiments demonstrate EVA achieves state-of-the-art results in real-world scenarios. Full results are provided at https://knightyxp.github.io/EVA/

EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing

TL;DR

Benefiting from the precise attention weight distribution, EVA can be easily generalized to multi-object editing scenarios and achieves accurate identity mapping.

Abstract

Paper Structure (16 sections, 5 equations, 10 figures, 3 tables)

This paper contains 16 sections, 5 equations, 10 figures, 3 tables.

Introduction
Related Work
Text-to-Image Editing/Generation
Text-to-Video Editing
EVA
What is the Key to Multi-Attribute Video Editing?
Accurate Text-to-Attribute Control
Avoiding Attention Leakage
Overall Framework
Spatial-Temporal Layout-Guided Attention
Experiments
Experimental Settings
Results
Qualitative and Quantitative Comparisons
Ablation Study
...and 1 more sections

Figures (10)

Figure 1: EVA achieves multi-attribute editing for both single and multi-object scenarios, adhering to the source video's layout and faithfully preserving motion information.
Figure 2: Previous methods failed results are displayed in single/multi-object scenes. EVA’s successful edit result is shown in the third row of Fig \ref{['intro']} left and the second row of Fig \ref{['intro']} right.
Figure 3: Intrinsic Cross-frame DIFT tang2023emergent feature correspondence. We randomly select a "red point" in the source image, extract its DIFT feature, and compute cosine similarity with the target image. The target's "red point" marks the highest similarity, and "blue point" is the lowest, showing the potential to unsupervised identify intra/inter attributes correspondence.
Figure 4: Left: FateZero qi2023fatezero fails in text-to-attribute control, incorrectly allocating weights to "snow" and not fully covering "man." Right: Although Ground-A-Video jeong2023ground attempts to ground each attribute individually, it still suffers from attention leakage, leading to texture blending on "Batman's" upper body and imprecise edits of "court" and "wall."
Figure 5: EVA pipeline. We integrate the ST-Layout Attn within the frozen SD in the denoising process. In the self-attention layer, we compute the positive/negative value of each query token in different attributes from a spatial-temporal perspective, This allows us to augment the attention scores for tokens within the same attribute and reduce them for tokens in different attributes. In the cross-attention layer, we extract each attribute's text embeddings from the edit prompt, ensuring they focus only on corresponding layouts across frames.
...and 5 more figures

EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing

TL;DR

Abstract

EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing

Authors

TL;DR

Abstract

Table of Contents

Figures (10)