Table of Contents
Fetching ...

FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing

Lingling Cai, Kang Zhao, Hangjie Yuan, Yingya Zhang, Shiwei Zhang, Kejie Huang

TL;DR

The paper addresses artifacts in zero-shot video editing caused by treating cross-attention masks as uniformly precise. It introduces Mask Matching Cost (MMC), with layer-wise ($LMMC$) and timestep-wise ($TMMC$) variants, to quantify mask quality and guide semantic-adaptive mask selection. The proposed FreeMask framework applies MMC-selected masks to comprehensive masked fusion across temporal, cross, and self-attention, enabling adaptive, task-specific precision without additional supervision or tuning. Extensive experiments across stylization, attribute, and shape editing demonstrate superior semantic fidelity and temporal coherence compared to state-of-the-art methods, and the approach generalizes across multiple text-to-video models. The work highlights a practical, training-free path to robust zero-shot video editing by systematically leveraging attention mask variability rather than relying on static or external masks.

Abstract

Text-to-video diffusion models have made remarkable advancements. Driven by their ability to generate temporally coherent videos, research on zero-shot video editing using these fundamental models has expanded rapidly. To enhance editing quality, structural controls are frequently employed in video editing. Among these techniques, cross-attention mask control stands out for its effectiveness and efficiency. However, when cross-attention masks are naively applied to video editing, they can introduce artifacts such as blurring and flickering. Our experiments uncover a critical factor overlooked in previous video editing research: cross-attention masks are not consistently clear but vary with model structure and denoising timestep. To address this issue, we propose the metric Mask Matching Cost (MMC) that quantifies this variability and propose FreeMask, a method for selecting optimal masks tailored to specific video editing tasks. Using MMC-selected masks, we further improve the masked fusion mechanism within comprehensive attention features, e.g., temp, cross, and self-attention modules. Our approach can be seamlessly integrated into existing zero-shot video editing frameworks with better performance, requiring no control assistance or parameter fine-tuning but enabling adaptive decoupling of unedited semantic layouts with mask precision control. Extensive experiments demonstrate that FreeMask achieves superior semantic fidelity, temporal consistency, and editing quality compared to state-of-the-art methods.

FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing

TL;DR

The paper addresses artifacts in zero-shot video editing caused by treating cross-attention masks as uniformly precise. It introduces Mask Matching Cost (MMC), with layer-wise () and timestep-wise () variants, to quantify mask quality and guide semantic-adaptive mask selection. The proposed FreeMask framework applies MMC-selected masks to comprehensive masked fusion across temporal, cross, and self-attention, enabling adaptive, task-specific precision without additional supervision or tuning. Extensive experiments across stylization, attribute, and shape editing demonstrate superior semantic fidelity and temporal coherence compared to state-of-the-art methods, and the approach generalizes across multiple text-to-video models. The work highlights a practical, training-free path to robust zero-shot video editing by systematically leveraging attention mask variability rather than relying on static or external masks.

Abstract

Text-to-video diffusion models have made remarkable advancements. Driven by their ability to generate temporally coherent videos, research on zero-shot video editing using these fundamental models has expanded rapidly. To enhance editing quality, structural controls are frequently employed in video editing. Among these techniques, cross-attention mask control stands out for its effectiveness and efficiency. However, when cross-attention masks are naively applied to video editing, they can introduce artifacts such as blurring and flickering. Our experiments uncover a critical factor overlooked in previous video editing research: cross-attention masks are not consistently clear but vary with model structure and denoising timestep. To address this issue, we propose the metric Mask Matching Cost (MMC) that quantifies this variability and propose FreeMask, a method for selecting optimal masks tailored to specific video editing tasks. Using MMC-selected masks, we further improve the masked fusion mechanism within comprehensive attention features, e.g., temp, cross, and self-attention modules. Our approach can be seamlessly integrated into existing zero-shot video editing frameworks with better performance, requiring no control assistance or parameter fine-tuning but enabling adaptive decoupling of unedited semantic layouts with mask precision control. Extensive experiments demonstrate that FreeMask achieves superior semantic fidelity, temporal consistency, and editing quality compared to state-of-the-art methods.
Paper Structure (23 sections, 12 equations, 11 figures, 2 tables)

This paper contains 23 sections, 12 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: (a) Visualization of the 'jeep' cross-attention maps across layers and denoising timesteps on zeroscope zeroscopev2. (b) TMMC of different models across timesteps. (c) LMMC of different models across layers.
  • Figure 2: FreeMask overview. FreeMask takes source video $\mathbf{X_0}$ and text prompt $P_0$ as input. During preprocessing, it stores cross-attention maps for each timestep across all videos in the DAVIS testing dataset to calculate LMMC and TMMC. In the inference stage, $\mathbf{X_0}$ and $P_0$ are input to DDIM inversion, storing attention features at each timestep and collecting the final latent output as the initial latent for denoising. Before denoising, masks $\mathbf{M^*}$ are adaptively promoted. During denoising, attention features are blended using masks $\mathbf{M^*}$. The final latent output $\mathbf{Z_0^*}$ is then decoded to produce the edited video.
  • Figure 3: Comparison results with several state-of-the-art approaches on three distinct tasks: stylization, attribute editing, and shape editing.
  • Figure 4: Ablation experiments. The experiments use Zeroscope zeroscopev2 as the base model and are conducted on a shape-editing task that changes a Jeep into a Porsche car. In these experiments. (a) is the original input video; (b) shows results using the latent output from DDIM inversion as the initial latent; (c) represents the fusion of self-attention based on (b); (d) involves the fusion of masked self-attention based on (b). Similar logic applies to other sub-captions.
  • Figure 5: Extension results on shape editing.
  • ...and 6 more figures