Table of Contents
Fetching ...

Ada-VE: Training-Free Consistent Video Editing Using Adaptive Motion Prior

Tanvir Mahmud, Mustafa Munir, Radu Marculescu, Diana Marculescu

TL;DR

This work proposes an adaptive motion-guided cross-frame attention mechanism that selectively reduces redundant computations and achieves a threefold increase in the number of keyframes processed compared to existing methods, all within the same computational budget as fully cross-frame attention base-lines.

Abstract

Video-to-video synthesis poses significant challenges in maintaining character consistency, smooth temporal transitions, and preserving visual quality during fast motion. While recent fully cross-frame self-attention mechanisms have improved character consistency across multiple frames, they come with high computational costs and often include redundant operations, especially for videos with higher frame rates. To address these inefficiencies, we propose an adaptive motion-guided cross-frame attention mechanism that selectively reduces redundant computations. This enables a greater number of cross-frame attentions over more frames within the same computational budget, thereby enhancing both video quality and temporal coherence. Our method leverages optical flow to focus on moving regions while sparsely attending to stationary areas, allowing for the joint editing of more frames without increasing computational demands. Traditional frame interpolation techniques struggle with motion blur and flickering in intermediate frames, which compromises visual fidelity. To mitigate this, we introduce KV-caching for jointly edited frames, reusing keys and values across intermediate frames to preserve visual quality and maintain temporal consistency throughout the video. With our adaptive cross-frame self-attention approach, we achieve a threefold increase in the number of keyframes processed compared to existing methods, all within the same computational budget as fully cross-frame attention baselines. This results in significant improvements in prediction accuracy and temporal consistency, outperforming state-of-the-art approaches. Code will be made publicly available at https://github.com/tanvir-utexas/AdaVE/tree/main

Ada-VE: Training-Free Consistent Video Editing Using Adaptive Motion Prior

TL;DR

This work proposes an adaptive motion-guided cross-frame attention mechanism that selectively reduces redundant computations and achieves a threefold increase in the number of keyframes processed compared to existing methods, all within the same computational budget as fully cross-frame attention base-lines.

Abstract

Video-to-video synthesis poses significant challenges in maintaining character consistency, smooth temporal transitions, and preserving visual quality during fast motion. While recent fully cross-frame self-attention mechanisms have improved character consistency across multiple frames, they come with high computational costs and often include redundant operations, especially for videos with higher frame rates. To address these inefficiencies, we propose an adaptive motion-guided cross-frame attention mechanism that selectively reduces redundant computations. This enables a greater number of cross-frame attentions over more frames within the same computational budget, thereby enhancing both video quality and temporal coherence. Our method leverages optical flow to focus on moving regions while sparsely attending to stationary areas, allowing for the joint editing of more frames without increasing computational demands. Traditional frame interpolation techniques struggle with motion blur and flickering in intermediate frames, which compromises visual fidelity. To mitigate this, we introduce KV-caching for jointly edited frames, reusing keys and values across intermediate frames to preserve visual quality and maintain temporal consistency throughout the video. With our adaptive cross-frame self-attention approach, we achieve a threefold increase in the number of keyframes processed compared to existing methods, all within the same computational budget as fully cross-frame attention baselines. This results in significant improvements in prediction accuracy and temporal consistency, outperforming state-of-the-art approaches. Code will be made publicly available at https://github.com/tanvir-utexas/AdaVE/tree/main
Paper Structure (21 sections, 6 equations, 6 figures, 3 tables)

This paper contains 21 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Effect of self-attention extension on diverse motion: Frames are sampled at intervals of 1 (slow motion) and 10 (fast motion). Two methods are compared: one uses the first frame's key-value (KV) across all frames for efficiency, while the other fully extends KVs, which is computationally intensive. The efficient method works well in slow motion but struggles with faster motion, where full extension achieves better results. This highlights the need for adaptive self-attention based on motion to enhance video quality, reduce redundant computations, and incorporate more frames in the self-attention process.
  • Figure 2: Ada-VE Overview: (i) During preprocessing, DDIM inversion is performed to extract deterministic noise $\mathbf{X}_T \sim \mathcal{N}(0, I)$, and successive coarse motion masks $\mathcal{M}$ are extracted using a lightweight optical flow model, in a single step. (ii) Several reference frames $\mathbf{X}_{t, ref}$ are then sampled at timestep $t$, and jointly edited iteratively with the proposed sparse extension of self-attention KVs guided by motion masks $\mathbf{\mathbf{\mathcal{M}}}$, with all extended KVs being cached. (iii) Finally, all intermediate frames $\mathbf{X}_{t, int}$ are edited using the cached sparse reference KVs at timestep $t$.
  • Figure 3: Motion Mask Extraction: A lightweight, off-the-shelf model is employed to extract coarse optical flow maps from each pair of successive frames. These flow maps are converted to RGB and then to grayscale images. Finally, a thresholding technique is applied to extract moving region masks for each frame. This operation is performed once during preprocessing in a single step.
  • Figure 4: (i) Basic Self-Attention: Queries (Q), keys (K), and values (V) are independently used for each frame. (ii) Fully Extended Self-Attention: All keys and values are combined into $K_{all}$ and $V_{all}$ for cross-frame self-attention. (iii) Proposed Sparsely Extended Self-Attention: Keys and values from all frames are sparsely extended into $K_{sparse}$ and $V_{sparse}$ to capture more details of moving regions than stationary background regions, utilizing motion masks $\mathbf{\mathcal{M}}$.
  • Figure 5: Qualitative comparisons with state-of-the-art video editing methods show that SDEdit meng2022sdedit struggles with motion blur and inconsistent character generation in longer videos, while ControlVideo zhang2023controlvideo also has inconsistency issues. TokenFlow geyer2023tokenflow offers better temporal consistency but still suffers from motion blur. In contrast, Ada-VE (ours) achieves superior visual quality and consistency throughout, building on the PnP pnpDiffusion2023 image editing baseline like TokenFlow but delivering higher-quality results.
  • ...and 1 more figures