Table of Contents
Fetching ...

Re-Attentional Controllable Video Diffusion Editing

Yuanzhi Wang, Yong Li, Mengyi Liu, Xiaoya Zhang, Xin Liu, Zhen Cui, Antoni B. Chan

TL;DR

ReAttentional Controllable Video Diffusion Editing (ReAtCo) tackles the challenge of fine-grained, spatially accurate text-guided video editing by introducing two training-free mechanisms: Re-Attentional Diffusion (RAD), which refocuses cross-attention maps to align edited objects with regions specified by text prompts, and Invariant Region-guided Joint Sampling (IRJS), which preserves background content and reduces border artifacts during denoising. By formalizing object changes and invariant regions, and integrating RAD and IRJS into a DDIM/Tune-A-Video-based framework, the method achieves improved spatial controllability, object-count adherence, and semantic fidelity across video frames. Extensive experiments on LOVEU-TGVE-2023 and other datasets demonstrate superior performance over state-of-the-art baselines, with notable gains in the VISOR spatial relationship metric and qualitative results showing well-aligned, harmonized edits. The work also provides practical guidelines for word selection, resource-friendly extensions, and considerations of broader impact and limitations.

Abstract

Editing videos with textual guidance has garnered popularity due to its streamlined process which mandates users to solely edit the text prompt corresponding to the source video. Recent studies have explored and exploited large-scale text-to-image diffusion models for text-guided video editing, resulting in remarkable video editing capabilities. However, they may still suffer from some limitations such as mislocated objects, incorrect number of objects. Therefore, the controllability of video editing remains a formidable challenge. In this paper, we aim to challenge the above limitations by proposing a Re-Attentional Controllable Video Diffusion Editing (ReAtCo) method. Specially, to align the spatial placement of the target objects with the edited text prompt in a training-free manner, we propose a Re-Attentional Diffusion (RAD) to refocus the cross-attention activation responses between the edited text prompt and the target video during the denoising stage, resulting in a spatially location-aligned and semantically high-fidelity manipulated video. In particular, to faithfully preserve the invariant region content with less border artifacts, we propose an Invariant Region-guided Joint Sampling (IRJS) strategy to mitigate the intrinsic sampling errors w.r.t the invariant regions at each denoising timestep and constrain the generated content to be harmonized with the invariant region content. Experimental results verify that ReAtCo consistently improves the controllability of video diffusion editing and achieves superior video editing performance.

Re-Attentional Controllable Video Diffusion Editing

TL;DR

ReAttentional Controllable Video Diffusion Editing (ReAtCo) tackles the challenge of fine-grained, spatially accurate text-guided video editing by introducing two training-free mechanisms: Re-Attentional Diffusion (RAD), which refocuses cross-attention maps to align edited objects with regions specified by text prompts, and Invariant Region-guided Joint Sampling (IRJS), which preserves background content and reduces border artifacts during denoising. By formalizing object changes and invariant regions, and integrating RAD and IRJS into a DDIM/Tune-A-Video-based framework, the method achieves improved spatial controllability, object-count adherence, and semantic fidelity across video frames. Extensive experiments on LOVEU-TGVE-2023 and other datasets demonstrate superior performance over state-of-the-art baselines, with notable gains in the VISOR spatial relationship metric and qualitative results showing well-aligned, harmonized edits. The work also provides practical guidelines for word selection, resource-friendly extensions, and considerations of broader impact and limitations.

Abstract

Editing videos with textual guidance has garnered popularity due to its streamlined process which mandates users to solely edit the text prompt corresponding to the source video. Recent studies have explored and exploited large-scale text-to-image diffusion models for text-guided video editing, resulting in remarkable video editing capabilities. However, they may still suffer from some limitations such as mislocated objects, incorrect number of objects. Therefore, the controllability of video editing remains a formidable challenge. In this paper, we aim to challenge the above limitations by proposing a Re-Attentional Controllable Video Diffusion Editing (ReAtCo) method. Specially, to align the spatial placement of the target objects with the edited text prompt in a training-free manner, we propose a Re-Attentional Diffusion (RAD) to refocus the cross-attention activation responses between the edited text prompt and the target video during the denoising stage, resulting in a spatially location-aligned and semantically high-fidelity manipulated video. In particular, to faithfully preserve the invariant region content with less border artifacts, we propose an Invariant Region-guided Joint Sampling (IRJS) strategy to mitigate the intrinsic sampling errors w.r.t the invariant regions at each denoising timestep and constrain the generated content to be harmonized with the invariant region content. Experimental results verify that ReAtCo consistently improves the controllability of video diffusion editing and achieves superior video editing performance.

Paper Structure

This paper contains 20 sections, 6 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Edited samples from the common video diffusion editing method (classic Tune-A-Video TAV as an example) and our proposed ReAtCo.
  • Figure 2: The framework of our proposed ReAtCo. Given a source video $\mcV$, ReAtCo first utilizes DDIM Inversion for video-to-noise inversion, and then the inverted noise is gradually denoised to an edited video $\mcV^{\text{edit}}$ by a video diffusion editing model. During the denoising stage, ReAtCo injects the proposed Re-Attentional Diffusion (RAD) and the user-specified regions of interest (i.e., the regions of two dolphins $\mcM_1$, $\mcM_2$) into video diffusion editing model to refocus the cross-attention maps (e.g., $\mcA^2(t)$ and $\mcA^5(t)$ for word index $2$ and $5$ at timestep $t$) between words of interest ("jellyfish" and "goldfish") and noisy video (e.g., $\mcX(t)$ at timestep $t$), thereby controlling the spatial location of the edited objects. In addition to the above, we design an Invariant Region-guided Joint Sampling (IRJS) to prevent the disruption of the invariant region with less border artifacts.
  • Figure 3: Edited video frames by different methods.
  • Figure 4: The framework of our proposed IRJS.
  • Figure 5: Visual comparisons of different methods in various scenes. Compared with these state-of-the-arts, ReAtCo can edit real-world videos with spatial location alignment, consistent number of objects, and high semantic fidelity.
  • ...and 3 more figures