Motion-Conditioned Image Animation for Video Editing
Wilson Yan, Andrew Brown, Pieter Abbeel, Rohit Girdhar, Samaneh Azadi
TL;DR
MoCA introduces a simple yet strong framework for video editing by decoupling image editing from motion-aware animation, using optical-flow conditioning to preserve original motion. The approach is trained as a latent diffusion model conditioned on text, the edited first frame, and motion, with an option to drop motion conditioning for motion edits. A new 271-task benchmark spanning style, background, object, and motion edits demonstrates MoCA's broad capabilities and human-preferred results over strong baselines. Automatic metrics based on VideoCLIP show correlations with human judgments but emphasize the ongoing need for better evaluation methods in video editing.
Abstract
We introduce MoCA, a Motion-Conditioned Image Animation approach for video editing. It leverages a simple decomposition of the video editing problem into image editing followed by motion-conditioned image animation. Furthermore, given the lack of robust evaluation datasets for video editing, we introduce a new benchmark that measures edit capability across a wide variety of tasks, such as object replacement, background changes, style changes, and motion edits. We present a comprehensive human evaluation of the latest video editing methods along with MoCA, on our proposed benchmark. MoCA establishes a new state-of-the-art, demonstrating greater human preference win-rate, and outperforming notable recent approaches including Dreamix (63%), MasaCtrl (75%), and Tune-A-Video (72%), with especially significant improvements for motion edits.
