Table of Contents
Fetching ...

VIA: Unified Spatiotemporal Video Adaptation Framework for Global and Local Video Editing

Jing Gu, Yuwei Fang, Ivan Skorokhodov, Peter Wonka, Xinya Du, Sergey Tulyakov, Xin Eric Wang

TL;DR

VIA tackles the problem of text-driven video editing that maintains both local frame fidelity and global temporal coherence over long videos. It introduces two complementary mechanisms: test-time editing adaptation with local latent adaptation for precise, instruction-following edits within frames, and spatiotemporal adaptation using a gather-and-swap attention strategy to preserve global consistency across the sequence. Empirical results show VIA yields more faithful, temporally coherent edits with faster processing than baselines, enabling minute-long video editing in practical timeframes. The approach offers a generalizable framework for unified local-global video adaptation with broad implications for media production, education, and entertainment, while highlighting considerations for ethics and safety in advanced video editing.

Abstract

Video editing serves as a fundamental pillar of digital media, spanning applications in entertainment, education, and professional communication. However, previous methods often overlook the necessity of comprehensively understanding both global and local contexts, leading to inaccurate and inconsistent edits in the spatiotemporal dimension, especially for long videos. In this paper, we introduce VIA, a unified spatiotemporal Video Adaptation framework for global and local video editing, pushing the limits of consistently editing minute-long videos. First, to ensure local consistency within individual frames, we designed test-time editing adaptation to adapt a pre-trained image editing model for improving consistency between potential editing directions and the text instruction, and adapts masked latent variables for precise local control. Furthermore, to maintain global consistency over the video sequence, we introduce spatiotemporal adaptation that recursively gather consistent attention variables in key frames and strategically applies them across the whole sequence to realize the editing effects. Extensive experiments demonstrate that, compared to baseline methods, our VIA approach produces edits that are more faithful to the source videos, more coherent in the spatiotemporal context, and more precise in local control. More importantly, we show that VIA can achieve consistent long video editing in minutes, unlocking the potential for advanced video editing tasks over long video sequences.

VIA: Unified Spatiotemporal Video Adaptation Framework for Global and Local Video Editing

TL;DR

VIA tackles the problem of text-driven video editing that maintains both local frame fidelity and global temporal coherence over long videos. It introduces two complementary mechanisms: test-time editing adaptation with local latent adaptation for precise, instruction-following edits within frames, and spatiotemporal adaptation using a gather-and-swap attention strategy to preserve global consistency across the sequence. Empirical results show VIA yields more faithful, temporally coherent edits with faster processing than baselines, enabling minute-long video editing in practical timeframes. The approach offers a generalizable framework for unified local-global video adaptation with broad implications for media production, education, and entertainment, while highlighting considerations for ethics and safety in advanced video editing.

Abstract

Video editing serves as a fundamental pillar of digital media, spanning applications in entertainment, education, and professional communication. However, previous methods often overlook the necessity of comprehensively understanding both global and local contexts, leading to inaccurate and inconsistent edits in the spatiotemporal dimension, especially for long videos. In this paper, we introduce VIA, a unified spatiotemporal Video Adaptation framework for global and local video editing, pushing the limits of consistently editing minute-long videos. First, to ensure local consistency within individual frames, we designed test-time editing adaptation to adapt a pre-trained image editing model for improving consistency between potential editing directions and the text instruction, and adapts masked latent variables for precise local control. Furthermore, to maintain global consistency over the video sequence, we introduce spatiotemporal adaptation that recursively gather consistent attention variables in key frames and strategically applies them across the whole sequence to realize the editing effects. Extensive experiments demonstrate that, compared to baseline methods, our VIA approach produces edits that are more faithful to the source videos, more coherent in the spatiotemporal context, and more precise in local control. More importantly, we show that VIA can achieve consistent long video editing in minutes, unlocking the potential for advanced video editing tasks over long video sequences.
Paper Structure (24 sections, 12 equations, 16 figures, 7 tables)

This paper contains 24 sections, 12 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Video editing results by Via. Via excels in precise and consistent editing across diverse video tasks. Top: consistent results over long videos with a duration of 1 minute, which is challenging in current literature. Bottom: consistent results for precise local editing.
  • Figure 2: Overview of Via framework. For local consistency, Test-time Editing Adaptation finetunes the editing model with augmented editing pairs to ensure consistent editing directions with the text instruction, and Local Latent Adaptation achieves precise editing control and preserves non-target pixels from the input video. For global consistency, Spatiotemporal Adaptation collects and applies key attention variables across all frames.
  • Figure 3: The gather-and-swap process for video editing. The left part of the diagram illustrates the gathering process. We initially sample $k+1$ frames evenly distributed throughout the video. The first frame undergoes standard editing using an image editing model, during which the attention variables are captured and stored. For each of the subsequent $k$ frames, the attention variable from the preceding frame is swapped in, and its own attention variables are also preserved. In the right part, the collected attention variables from all $k+1$ frames are swapped into the editing process of each frame. This includes applying the previously gathered attention variables to enhance the consistency and quality of edits across the sequence.
  • Figure 4: Local editing results.Via is capable of performing a wide range of localized editing tasks, where only specific regions or pixels within a frame are modified. The video length is introduced in the text below the video frames.
  • Figure 5: Global editing results.Via demonstrates robust global editing performance across various videos using a consistent set of editing instructions, producing high-quality results. The videos are of length 2-minute, 1-minute video, 30 seconds, and 7 seconds.
  • ...and 11 more figures