VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing

Xiangpeng Yang; Linchao Zhu; Hehe Fan; Yi Yang

VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing

Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Yang

TL;DR

VideoGrain tackles the challenge of zero-shot multi-grained video editing across class, instance, and part levels by introducing Spatial-Temporal Layout-Guided Attention (ST-Layout Attn) to modulate both cross- and self-attention in diffusion-based video editing. The method enhances text-to-region control by biasing cross-attention toward region-specific prompts and enforces feature separation via self-attention modulation, reducing inter-region interference. It operates without additional training and supports ControlNet conditioning, achieving state-of-the-art performance on real videos and multiple editing granularities, including occluded scenes and background preservation. Practically, VideoGrain offers flexible, high-fidelity video editing with improved temporal consistency and regional specificity, broadening applicability in video retouching and content creation while acknowledging ethical considerations and potential misuse.

Abstract

Recent advancements in diffusion models have significantly improved video generation and editing capabilities. However, multi-grained video editing, which encompasses class-level, instance-level, and part-level modifications, remains a formidable challenge. The major difficulties in multi-grained editing include semantic misalignment of text-to-region control and feature coupling within the diffusion model. To address these difficulties, we present VideoGrain, a zero-shot approach that modulates space-time (cross- and self-) attention mechanisms to achieve fine-grained control over video content. We enhance text-to-region control by amplifying each local prompt's attention to its corresponding spatial-disentangled region while minimizing interactions with irrelevant areas in cross-attention. Additionally, we improve feature separation by increasing intra-region awareness and reducing inter-region interference in self-attention. Extensive experiments demonstrate our method achieves state-of-the-art performance in real-world scenarios. Our code, data, and demos are available at https://knightyxp.github.io/VideoGrain_project_page/

VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing

TL;DR

Abstract

VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (18)