MotionV2V: Editing Motion in a Video
Ryan Burgert, Charles Herrmann, Forrester Cole, Michael S Ryoo, Neal Wadhwa, Andrey Voynov, Nataniel Ruiz
TL;DR
MotionV2V tackles the problem of editing motion in existing videos by directly manipulating sparse trajectories, introducing the concept of a motion edit as the delta between input and edited trajectories. The authors generate motion counterfactual video pairs to supervise a motion-conditioned diffusion backbone, and design a three-stream conditioning scheme (counterfactual video, counterfactual motion tracks, and target motion tracks) fed through a ControlNet-like adapter on a pre-trained diffusion model. The method supports object and camera motion edits, temporal retiming, and edits across arbitrary frames, achieving strong user preferences and quantitative gains over first-frame I2V baselines. The work demonstrates robust content preservation and flexible, iterative motion edits with practical implications for editing complex scenes without manual masking, while outlining avenues for improved ground-truth data and larger synthetic datasets.
Abstract
While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has explored motion controllability as a means to enhance text-to-video generation or image animation; however, we identify precise motion control as a promising yet under-explored paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a "motion edit" and demonstrate that this representation, when coupled with a generative backbone, enables powerful video editing capabilities. To achieve this, we introduce a pipeline for generating "motion counterfactuals", video pairs that share identical content but distinct motion, and we fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a four-way head-to-head user study, our model achieves over 65 percent preference against prior work. Please see our project page: https://ryanndagreat.github.io/MotionV2V
