Table of Contents
Fetching ...

Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models

Saman Motamed, Wouter Van Gansbeke, Luc Van Gool

TL;DR

This work addresses zero-shot editing in Text-to-Video diffusion models by examining cross-attention as a control mechanism. It compares forward guidance, which swaps cross-attention maps between prompts, with backward guidance, which uses an energy-based objective to shape attentions and update the latent, highlighting that forward guidance suffers from size and overlap artifacts while backward guidance shows promise for editing object size and motion. The study identifies current T2V limitations, particularly noisy cross-attention maps, and demonstrates the potential of backward guidance to enable editing without additional training data. Overall, the findings suggest that cross-attention-based editing is a viable path toward practical zero-shot video editing, motivating improvements in attention quality and video fidelity in future models.

Abstract

With recent advances in image and video diffusion models for content creation, a plethora of techniques have been proposed for customizing their generated content. In particular, manipulating the cross-attention layers of Text-to-Image (T2I) diffusion models has shown great promise in controlling the shape and location of objects in the scene. Transferring image-editing techniques to the video domain, however, is extremely challenging as object motion and temporal consistency are difficult to capture accurately. In this work, we take a first look at the role of cross-attention in Text-to-Video (T2V) diffusion models for zero-shot video editing. While one-shot models have shown potential in controlling motion and camera movement, we demonstrate zero-shot control over object shape, position and movement in T2V models. We show that despite the limitations of current T2V models, cross-attention guidance can be a promising approach for editing videos.

Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models

TL;DR

This work addresses zero-shot editing in Text-to-Video diffusion models by examining cross-attention as a control mechanism. It compares forward guidance, which swaps cross-attention maps between prompts, with backward guidance, which uses an energy-based objective to shape attentions and update the latent, highlighting that forward guidance suffers from size and overlap artifacts while backward guidance shows promise for editing object size and motion. The study identifies current T2V limitations, particularly noisy cross-attention maps, and demonstrates the potential of backward guidance to enable editing without additional training data. Overall, the findings suggest that cross-attention-based editing is a viable path toward practical zero-shot video editing, motivating improvements in attention quality and video fidelity in future models.

Abstract

With recent advances in image and video diffusion models for content creation, a plethora of techniques have been proposed for customizing their generated content. In particular, manipulating the cross-attention layers of Text-to-Image (T2I) diffusion models has shown great promise in controlling the shape and location of objects in the scene. Transferring image-editing techniques to the video domain, however, is extremely challenging as object motion and temporal consistency are difficult to capture accurately. In this work, we take a first look at the role of cross-attention in Text-to-Video (T2V) diffusion models for zero-shot video editing. While one-shot models have shown potential in controlling motion and camera movement, we demonstrate zero-shot control over object shape, position and movement in T2V models. We show that despite the limitations of current T2V models, cross-attention guidance can be a promising approach for editing videos.
Paper Structure (20 sections, 5 equations, 5 figures)

This paper contains 20 sections, 5 equations, 5 figures.

Figures (5)

  • Figure 1: This figure shows an overview of backward guidance in T2V models. On the left, we show the generated frames of the T2V model after t steps, given an initial input latent $z_t$ and the text prompt "A burger floats on the water". To edit the video and move the burger from the top-left of the screen to the bottom-left in a straight line, we generate $\mathcal{A}_{tar_{\text{fi}}}$ for each frame fi reflecting this edit. Following the scheme in Section \ref{['exp:details']}, we update the latent through the denoising process based on objective $E$. At time step 0, $z_0$ generates the video on the right which reflects the intended edit.
  • Figure 2: We show an example of forward guidance by swapping the cross-attention maps of "car" with cross-attention maps of the "truck". The two input texts only differ in one token ("truck" and "car"). While the car follows the motion and location of the truck in the video, artifacts can be seen around the car due to the mismatch in size and shape of the truck and car.
  • Figure 3: We compare the cross-attention maps for the same prompt to a T2I and T2V model. The cross-attention maps are extracted and averaged at the $16 \times 16$ resolution from the mid-blocks and up-blocks of the U-Net. Open-source T2I models currently produce much less noisy cross-attention maps compared to T2V models. In Section \ref{['label:detailxatt']}, we give details on how the noisy cross-attentions hinder backward guidance and propose a procedure for bypassing this limitation for our experiments in this paper.
  • Figure 4: We show qualitative results for shrinking and enlarging objects through backward guidance. The middle image of each row visualizes the first frame of the original video. We enlarge and shrink the target cross-attentions at four different levels (Big / Bigger and small / smaller) and update the latent through backward guidance. The first frame for each edited video is shown.
  • Figure 5: The figure visualizes the results of backward cross-attention guidance. For each of the 4 examples, we show the output of the T2V model given the prompt in black. The blue text describes the applied transformation to the cross-attentions at each frame. We update the input latent accordingly. The red bounding box highlights the edit's success.