Table of Contents
Fetching ...

VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing

Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Yang

TL;DR

VideoGrain tackles the challenge of zero-shot multi-grained video editing across class, instance, and part levels by introducing Spatial-Temporal Layout-Guided Attention (ST-Layout Attn) to modulate both cross- and self-attention in diffusion-based video editing. The method enhances text-to-region control by biasing cross-attention toward region-specific prompts and enforces feature separation via self-attention modulation, reducing inter-region interference. It operates without additional training and supports ControlNet conditioning, achieving state-of-the-art performance on real videos and multiple editing granularities, including occluded scenes and background preservation. Practically, VideoGrain offers flexible, high-fidelity video editing with improved temporal consistency and regional specificity, broadening applicability in video retouching and content creation while acknowledging ethical considerations and potential misuse.

Abstract

Recent advancements in diffusion models have significantly improved video generation and editing capabilities. However, multi-grained video editing, which encompasses class-level, instance-level, and part-level modifications, remains a formidable challenge. The major difficulties in multi-grained editing include semantic misalignment of text-to-region control and feature coupling within the diffusion model. To address these difficulties, we present VideoGrain, a zero-shot approach that modulates space-time (cross- and self-) attention mechanisms to achieve fine-grained control over video content. We enhance text-to-region control by amplifying each local prompt's attention to its corresponding spatial-disentangled region while minimizing interactions with irrelevant areas in cross-attention. Additionally, we improve feature separation by increasing intra-region awareness and reducing inter-region interference in self-attention. Extensive experiments demonstrate our method achieves state-of-the-art performance in real-world scenarios. Our code, data, and demos are available at https://knightyxp.github.io/VideoGrain_project_page/

VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing

TL;DR

VideoGrain tackles the challenge of zero-shot multi-grained video editing across class, instance, and part levels by introducing Spatial-Temporal Layout-Guided Attention (ST-Layout Attn) to modulate both cross- and self-attention in diffusion-based video editing. The method enhances text-to-region control by biasing cross-attention toward region-specific prompts and enforces feature separation via self-attention modulation, reducing inter-region interference. It operates without additional training and supports ControlNet conditioning, achieving state-of-the-art performance on real videos and multiple editing granularities, including occluded scenes and background preservation. Practically, VideoGrain offers flexible, high-fidelity video editing with improved temporal consistency and regional specificity, broadening applicability in video retouching and content creation while acknowledging ethical considerations and potential misuse.

Abstract

Recent advancements in diffusion models have significantly improved video generation and editing capabilities. However, multi-grained video editing, which encompasses class-level, instance-level, and part-level modifications, remains a formidable challenge. The major difficulties in multi-grained editing include semantic misalignment of text-to-region control and feature coupling within the diffusion model. To address these difficulties, we present VideoGrain, a zero-shot approach that modulates space-time (cross- and self-) attention mechanisms to achieve fine-grained control over video content. We enhance text-to-region control by amplifying each local prompt's attention to its corresponding spatial-disentangled region while minimizing interactions with irrelevant areas in cross-attention. Additionally, we improve feature separation by increasing intra-region awareness and reducing inter-region interference in self-attention. Extensive experiments demonstrate our method achieves state-of-the-art performance in real-world scenarios. Our code, data, and demos are available at https://knightyxp.github.io/VideoGrain_project_page/

Paper Structure

This paper contains 27 sections, 7 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: VideoGrain enables multi-grained video editing across class, instance, and part levels.
  • Figure 2: Definition of multi-grained video editing and comparison on instance editing
  • Figure 2: Efficiency comparison.
  • Figure 3: Analysis of why the diffusion model failed in instance-level video editing. Our goal is to edit left man into "Iron Man," right man into "Spiderman," and trees into "cherry blossoms." In (b), we apply K-Means on self-attention, and in (d), we visualize the 32x32 cross-attention map.
  • Figure 4: VideoGrain pipeline. (1) we integrate ST-Layout Attn into the frozen SD for multi-grained editing, where we modulate self- and cross-attention in a unified manner. (2) In cross-attention, we view each local prompt and its location as positive pairs, while the prompt and outside-location areas are negative pairs, enabling text-to-region control. (3) In self-attention, we enhance positive awareness within intra-regions and restrict negative interactions between inter-regions across frames, making each query only attend to the target region and keep feature separation. In the bottom two figures, $p$ denotes original attention score and $w,i$ denotes the word and frame index.
  • ...and 13 more figures