Table of Contents
Fetching ...

ReVideo: Remake a Video with Motion and Content Control

Chong Mou, Mingdeng Cao, Xintao Wang, Zhaoyang Zhang, Ying Shan, Jian Zhang

TL;DR

ReVideo addresses the challenge of precise local video editing with simultaneous content and motion control. It introduces a three-stage training strategy and a spatiotemporal adaptive fusion module to decouple and fuse content and motion signals within a diffusion-based video generation framework built on Stable Video Diffusion. The approach enables local content changes, custom motion trajectories, and multi-area editing, demonstrated on extensive experiments with quantitative and human evaluations showing improvements over baselines. The work highlights the importance of addressing condition coupling and training imbalance for practical, controllable video editing.

Abstract

Despite significant advancements in video generation and editing using diffusion models, achieving accurate and localized video editing remains a substantial challenge. Additionally, most existing video editing methods primarily focus on altering visual content, with limited research dedicated to motion editing. In this paper, we present a novel attempt to Remake a Video (ReVideo) which stands out from existing methods by allowing precise video editing in specific areas through the specification of both content and motion. Content editing is facilitated by modifying the first frame, while the trajectory-based motion control offers an intuitive user interaction experience. ReVideo addresses a new task involving the coupling and training imbalance between content and motion control. To tackle this, we develop a three-stage training strategy that progressively decouples these two aspects from coarse to fine. Furthermore, we propose a spatiotemporal adaptive fusion module to integrate content and motion control across various sampling steps and spatial locations. Extensive experiments demonstrate that our ReVideo has promising performance on several accurate video editing applications, i.e., (1) locally changing video content while keeping the motion constant, (2) keeping content unchanged and customizing new motion trajectories, (3) modifying both content and motion trajectories. Our method can also seamlessly extend these applications to multi-area editing without specific training, demonstrating its flexibility and robustness.

ReVideo: Remake a Video with Motion and Content Control

TL;DR

ReVideo addresses the challenge of precise local video editing with simultaneous content and motion control. It introduces a three-stage training strategy and a spatiotemporal adaptive fusion module to decouple and fuse content and motion signals within a diffusion-based video generation framework built on Stable Video Diffusion. The approach enables local content changes, custom motion trajectories, and multi-area editing, demonstrated on extensive experiments with quantitative and human evaluations showing improvements over baselines. The work highlights the importance of addressing condition coupling and training imbalance for practical, controllable video editing.

Abstract

Despite significant advancements in video generation and editing using diffusion models, achieving accurate and localized video editing remains a substantial challenge. Additionally, most existing video editing methods primarily focus on altering visual content, with limited research dedicated to motion editing. In this paper, we present a novel attempt to Remake a Video (ReVideo) which stands out from existing methods by allowing precise video editing in specific areas through the specification of both content and motion. Content editing is facilitated by modifying the first frame, while the trajectory-based motion control offers an intuitive user interaction experience. ReVideo addresses a new task involving the coupling and training imbalance between content and motion control. To tackle this, we develop a three-stage training strategy that progressively decouples these two aspects from coarse to fine. Furthermore, we propose a spatiotemporal adaptive fusion module to integrate content and motion control across various sampling steps and spatial locations. Extensive experiments demonstrate that our ReVideo has promising performance on several accurate video editing applications, i.e., (1) locally changing video content while keeping the motion constant, (2) keeping content unchanged and customizing new motion trajectories, (3) modifying both content and motion trajectories. Our method can also seamlessly extend these applications to multi-area editing without specific training, demonstrating its flexibility and robustness.
Paper Structure (22 sections, 5 equations, 12 figures, 1 table)

This paper contains 22 sections, 5 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Two potential structures to inject motion and content control.
  • Figure 2: The motion control capability of two structures in Fig. \ref{['toy']} with different training strategies. We visualize trajectory lines in a specific area (red box) and label the editing area with a black box. Toy experiments present the coupling issue of customized motion and unedited content.
  • Figure 3: The data construction strategy for decoupling training and editing results from this stage.
  • Figure 4: The architecture of our proposed spatiotemporal adaptive fusion module (left), and the visualization of fusion weight $\mathbf{\Gamma}$ at different timesteps (right).
  • Figure 5: The visual comparison between InsV2V insv2v, AnyV2V anyv2v, Pika pika, and our ReVideo.
  • ...and 7 more figures