Table of Contents
Fetching ...

MotionV2V: Editing Motion in a Video

Ryan Burgert, Charles Herrmann, Forrester Cole, Michael S Ryoo, Neal Wadhwa, Andrey Voynov, Nataniel Ruiz

TL;DR

MotionV2V tackles the problem of editing motion in existing videos by directly manipulating sparse trajectories, introducing the concept of a motion edit as the delta between input and edited trajectories. The authors generate motion counterfactual video pairs to supervise a motion-conditioned diffusion backbone, and design a three-stream conditioning scheme (counterfactual video, counterfactual motion tracks, and target motion tracks) fed through a ControlNet-like adapter on a pre-trained diffusion model. The method supports object and camera motion edits, temporal retiming, and edits across arbitrary frames, achieving strong user preferences and quantitative gains over first-frame I2V baselines. The work demonstrates robust content preservation and flexible, iterative motion edits with practical implications for editing complex scenes without manual masking, while outlining avenues for improved ground-truth data and larger synthetic datasets.

Abstract

While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has explored motion controllability as a means to enhance text-to-video generation or image animation; however, we identify precise motion control as a promising yet under-explored paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a "motion edit" and demonstrate that this representation, when coupled with a generative backbone, enables powerful video editing capabilities. To achieve this, we introduce a pipeline for generating "motion counterfactuals", video pairs that share identical content but distinct motion, and we fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a four-way head-to-head user study, our model achieves over 65 percent preference against prior work. Please see our project page: https://ryanndagreat.github.io/MotionV2V

MotionV2V: Editing Motion in a Video

TL;DR

MotionV2V tackles the problem of editing motion in existing videos by directly manipulating sparse trajectories, introducing the concept of a motion edit as the delta between input and edited trajectories. The authors generate motion counterfactual video pairs to supervise a motion-conditioned diffusion backbone, and design a three-stream conditioning scheme (counterfactual video, counterfactual motion tracks, and target motion tracks) fed through a ControlNet-like adapter on a pre-trained diffusion model. The method supports object and camera motion edits, temporal retiming, and edits across arbitrary frames, achieving strong user preferences and quantitative gains over first-frame I2V baselines. The work demonstrates robust content preservation and flexible, iterative motion edits with practical implications for editing complex scenes without manual masking, while outlining avenues for improved ground-truth data and larger synthetic datasets.

Abstract

While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has explored motion controllability as a means to enhance text-to-video generation or image animation; however, we identify precise motion control as a promising yet under-explored paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a "motion edit" and demonstrate that this representation, when coupled with a generative backbone, enables powerful video editing capabilities. To achieve this, we introduce a pipeline for generating "motion counterfactuals", video pairs that share identical content but distinct motion, and we fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a four-way head-to-head user study, our model achieves over 65 percent preference against prior work. Please see our project page: https://ryanndagreat.github.io/MotionV2V

Paper Structure

This paper contains 27 sections, 1 equation, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Motion Edits Framework: Users provide an input video along with source motion tracks (colored dots connected by lines, extracted from the input) and target motion tracks (user-specified desired motion). Lines indicate point trajectories while dot presence/absence indicates visibility. Our diffusion model generates an output video matching the target motion. Applications: Our method can edit videos in a true sense, where content is preserved but motion is changed.
  • Figure 2: From left to right respectively, Cat Fish. In the edited video, the cat moves away from the bowl. Camera control. In the edited video, the first frame is zoomed out, middle frame is identical, the last frame is zoomed in. Duck Zoom. The edited video exhibits different content for a given frame (time) than the original, e.g. in the edited video, the duck is not visible in the first frame whereas it is visible in the original.
  • Figure 3: Controlling Content on Any Frame. By conditioning on the full video, we can move and preserve content appearing on any frame. Methods like ATI rely on the first frame, failing to control objects, like the sign, that emerge mid-sequence.
  • Figure 4: Counterfactual data generation process. In order to generate a real / counterfactual video pair and its corresponding trajectories, we take a full real video, extract a video clip, then create a counterfactual video. The counterfactual has new motion from the video generator, as well as temporal and spatial augmentations. In order to ensure we have two corresponding set of tracks, we specifically use the first and last frames, which directly match the original video, to anchor the tracks for the counterfactual.
  • Figure 5: Our motion-conditioned video diffusion architecture. We extend a T2V DiT model with a control branch that processes three additional video conditioning channels: the counterfactual video, counterfactual motion tracks, and target motion tracks. The control branch duplicates the first 18 transformer blocks and integrates with the main branch through zero-initialized MLPs, similar to ControlNet.
  • ...and 7 more figures