Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models
Saman Motamed, Wouter Van Gansbeke, Luc Van Gool
TL;DR
This work addresses zero-shot editing in Text-to-Video diffusion models by examining cross-attention as a control mechanism. It compares forward guidance, which swaps cross-attention maps between prompts, with backward guidance, which uses an energy-based objective to shape attentions and update the latent, highlighting that forward guidance suffers from size and overlap artifacts while backward guidance shows promise for editing object size and motion. The study identifies current T2V limitations, particularly noisy cross-attention maps, and demonstrates the potential of backward guidance to enable editing without additional training data. Overall, the findings suggest that cross-attention-based editing is a viable path toward practical zero-shot video editing, motivating improvements in attention quality and video fidelity in future models.
Abstract
With recent advances in image and video diffusion models for content creation, a plethora of techniques have been proposed for customizing their generated content. In particular, manipulating the cross-attention layers of Text-to-Image (T2I) diffusion models has shown great promise in controlling the shape and location of objects in the scene. Transferring image-editing techniques to the video domain, however, is extremely challenging as object motion and temporal consistency are difficult to capture accurately. In this work, we take a first look at the role of cross-attention in Text-to-Video (T2V) diffusion models for zero-shot video editing. While one-shot models have shown potential in controlling motion and camera movement, we demonstrate zero-shot control over object shape, position and movement in T2V models. We show that despite the limitations of current T2V models, cross-attention guidance can be a promising approach for editing videos.
