Table of Contents
Fetching ...

MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation

Haoyu Zheng, Wenqiao Zhang, Zheqi Lv, Yu Zhong, Yang Dai, Jianxiang An, Yongliang Shen, Juncheng Li, Dongping Zhang, Siliang Tang, Yueting Zhuang

TL;DR

MAKIMA addresses open-domain multi-attribute video editing without fine-tuning by introducing mask-guided attention modulation and mutual spatial-temporal self-attention to preserve video structure while applying attribute changes. The method leverages DDIM inversion to reuse source-frame structure and features, and employs per-attribute masks to suppress attention leakage in both self- and cross-attention during denoising. A consistent feature propagation strategy edits keyframes and propagates their modulated features across the sequence to balance quality and efficiency. Experiments on diverse videos demonstrate improved editing accuracy and temporal coherence over tuning-free baselines, demonstrating practical impact for open-domain video editing without additional networks or fine-tuning.

Abstract

Diffusion-based text-to-image (T2I) models have demonstrated remarkable results in global video editing tasks. However, their focus is primarily on global video modifications, and achieving desired attribute-specific changes remains a challenging task, specifically in multi-attribute editing (MAE) in video. Contemporary video editing approaches either require extensive fine-tuning or rely on additional networks (such as ControlNet) for modeling multi-object appearances, yet they remain in their infancy, offering only coarse-grained MAE solutions. In this paper, we present MAKIMA, a tuning-free MAE framework built upon pretrained T2I models for open-domain video editing. Our approach preserves video structure and appearance information by incorporating attention maps and features from the inversion process during denoising. To facilitate precise editing of multiple attributes, we introduce mask-guided attention modulation, enhancing correlations between spatially corresponding tokens and suppressing cross-attribute interference in both self-attention and cross-attention layers. To balance video frame generation quality and efficiency, we implement consistent feature propagation, which generates frame sequences by editing keyframes and propagating their features throughout the sequence. Extensive experiments demonstrate that MAKIMA outperforms existing baselines in open-domain multi-attribute video editing tasks, achieving superior results in both editing accuracy and temporal consistency while maintaining computational efficiency.

MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation

TL;DR

MAKIMA addresses open-domain multi-attribute video editing without fine-tuning by introducing mask-guided attention modulation and mutual spatial-temporal self-attention to preserve video structure while applying attribute changes. The method leverages DDIM inversion to reuse source-frame structure and features, and employs per-attribute masks to suppress attention leakage in both self- and cross-attention during denoising. A consistent feature propagation strategy edits keyframes and propagates their modulated features across the sequence to balance quality and efficiency. Experiments on diverse videos demonstrate improved editing accuracy and temporal coherence over tuning-free baselines, demonstrating practical impact for open-domain video editing without additional networks or fine-tuning.

Abstract

Diffusion-based text-to-image (T2I) models have demonstrated remarkable results in global video editing tasks. However, their focus is primarily on global video modifications, and achieving desired attribute-specific changes remains a challenging task, specifically in multi-attribute editing (MAE) in video. Contemporary video editing approaches either require extensive fine-tuning or rely on additional networks (such as ControlNet) for modeling multi-object appearances, yet they remain in their infancy, offering only coarse-grained MAE solutions. In this paper, we present MAKIMA, a tuning-free MAE framework built upon pretrained T2I models for open-domain video editing. Our approach preserves video structure and appearance information by incorporating attention maps and features from the inversion process during denoising. To facilitate precise editing of multiple attributes, we introduce mask-guided attention modulation, enhancing correlations between spatially corresponding tokens and suppressing cross-attribute interference in both self-attention and cross-attention layers. To balance video frame generation quality and efficiency, we implement consistent feature propagation, which generates frame sequences by editing keyframes and propagating their features throughout the sequence. Extensive experiments demonstrate that MAKIMA outperforms existing baselines in open-domain multi-attribute video editing tasks, achieving superior results in both editing accuracy and temporal consistency while maintaining computational efficiency.
Paper Structure (16 sections, 16 equations, 7 figures, 2 tables)

This paper contains 16 sections, 16 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: MAKIMA achieves open-domain multi-attribute video editing while maintaining the structure of the source video without tuning.
  • Figure 2: (a) Failure cases of multi-attribute video editing by previous methods. MAKIMA achieves precise attribute modifications while preserving the structural composition of source frames. (b) Through Mask-guided Attention Modulation, MAKIMA aligns the attention distribution of different attributes with their corresponding spatial layouts in the source video.
  • Figure 3: MAKIMA pipeline. After performing DDIM inversion to obtain latent features and attention maps, we inflate the UNet for denoising with Mutual Spatial-Temporal Self-Attention. During denoising, we utilize pre-computed attribute masks to guide attention modulation: enhancing intra-attribute correlations while suppressing inter-attribute interference in self-attention, and controlling text-guided appearance transformation in cross-attention.
  • Figure 4: Qualitative comparison with baselines: Our method achieves precise attribute-specific editing while maintaining structural consistency with the original video frames.
  • Figure 5: User study results comparing different methods.
  • ...and 2 more figures