VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing
Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Yang
TL;DR
VideoGrain tackles the challenge of zero-shot multi-grained video editing across class, instance, and part levels by introducing Spatial-Temporal Layout-Guided Attention (ST-Layout Attn) to modulate both cross- and self-attention in diffusion-based video editing. The method enhances text-to-region control by biasing cross-attention toward region-specific prompts and enforces feature separation via self-attention modulation, reducing inter-region interference. It operates without additional training and supports ControlNet conditioning, achieving state-of-the-art performance on real videos and multiple editing granularities, including occluded scenes and background preservation. Practically, VideoGrain offers flexible, high-fidelity video editing with improved temporal consistency and regional specificity, broadening applicability in video retouching and content creation while acknowledging ethical considerations and potential misuse.
Abstract
Recent advancements in diffusion models have significantly improved video generation and editing capabilities. However, multi-grained video editing, which encompasses class-level, instance-level, and part-level modifications, remains a formidable challenge. The major difficulties in multi-grained editing include semantic misalignment of text-to-region control and feature coupling within the diffusion model. To address these difficulties, we present VideoGrain, a zero-shot approach that modulates space-time (cross- and self-) attention mechanisms to achieve fine-grained control over video content. We enhance text-to-region control by amplifying each local prompt's attention to its corresponding spatial-disentangled region while minimizing interactions with irrelevant areas in cross-attention. Additionally, we improve feature separation by increasing intra-region awareness and reducing inter-region interference in self-attention. Extensive experiments demonstrate our method achieves state-of-the-art performance in real-world scenarios. Our code, data, and demos are available at https://knightyxp.github.io/VideoGrain_project_page/
