Table of Contents
Fetching ...

Motion-Conditioned Image Animation for Video Editing

Wilson Yan, Andrew Brown, Pieter Abbeel, Rohit Girdhar, Samaneh Azadi

TL;DR

MoCA introduces a simple yet strong framework for video editing by decoupling image editing from motion-aware animation, using optical-flow conditioning to preserve original motion. The approach is trained as a latent diffusion model conditioned on text, the edited first frame, and motion, with an option to drop motion conditioning for motion edits. A new 271-task benchmark spanning style, background, object, and motion edits demonstrates MoCA's broad capabilities and human-preferred results over strong baselines. Automatic metrics based on VideoCLIP show correlations with human judgments but emphasize the ongoing need for better evaluation methods in video editing.

Abstract

We introduce MoCA, a Motion-Conditioned Image Animation approach for video editing. It leverages a simple decomposition of the video editing problem into image editing followed by motion-conditioned image animation. Furthermore, given the lack of robust evaluation datasets for video editing, we introduce a new benchmark that measures edit capability across a wide variety of tasks, such as object replacement, background changes, style changes, and motion edits. We present a comprehensive human evaluation of the latest video editing methods along with MoCA, on our proposed benchmark. MoCA establishes a new state-of-the-art, demonstrating greater human preference win-rate, and outperforming notable recent approaches including Dreamix (63%), MasaCtrl (75%), and Tune-A-Video (72%), with especially significant improvements for motion edits.

Motion-Conditioned Image Animation for Video Editing

TL;DR

MoCA introduces a simple yet strong framework for video editing by decoupling image editing from motion-aware animation, using optical-flow conditioning to preserve original motion. The approach is trained as a latent diffusion model conditioned on text, the edited first frame, and motion, with an option to drop motion conditioning for motion edits. A new 271-task benchmark spanning style, background, object, and motion edits demonstrates MoCA's broad capabilities and human-preferred results over strong baselines. Automatic metrics based on VideoCLIP show correlations with human judgments but emphasize the ongoing need for better evaluation methods in video editing.

Abstract

We introduce MoCA, a Motion-Conditioned Image Animation approach for video editing. It leverages a simple decomposition of the video editing problem into image editing followed by motion-conditioned image animation. Furthermore, given the lack of robust evaluation datasets for video editing, we introduce a new benchmark that measures edit capability across a wide variety of tasks, such as object replacement, background changes, style changes, and motion edits. We present a comprehensive human evaluation of the latest video editing methods along with MoCA, on our proposed benchmark. MoCA establishes a new state-of-the-art, demonstrating greater human preference win-rate, and outperforming notable recent approaches including Dreamix (63%), MasaCtrl (75%), and Tune-A-Video (72%), with especially significant improvements for motion edits.
Paper Structure (19 sections, 4 equations, 11 figures, 7 tables)

This paper contains 19 sections, 4 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: is able to generate a diverse range of edits, such as object replacement, style changes, and motion edits. The frames in the top row in each example represent the source video while the bottom ones show the edited frames by MoCA. The source and editing prompts are shown above each example.
  • Figure 2: An overview of MoCA. Given a source video, we compute its optical flow, and apply image editing techniques on the first frame. To produce the resulting video edit, we sample our model conditioned on motion, the edited first frame, and the edit caption. For motion-based edits, we dropout the optical flow conditioning.
  • Figure 3: Comparison of our method against baselines for a given video editing task. Our method is able to accurately edit both the spatial and temporal properties of the source video.
  • Figure 4: Percentage of each reason selected when human evaluators prefer MoCA edits to each of the baselines. The reasons for picking one model over another on each video edit could be either its better alignment with the edit prompt, higher consistency with the source video, or both. Generally, human raters preferred our method in terms of better alignment with the desired edit prompt.
  • Figure 5: MoCA edits for "A boat sailing on the moon" with and without motion conditioning. Using motion conditioning allows the model to more faithfully follow the boat's movement in the original source video. Without motion conditioning, the model tends to generate more random movement directions, such as moving backwards.
  • ...and 6 more figures