Table of Contents
Fetching ...

Versatile Editing of Video Content, Actions, and Dynamics without Training

Vladimir Kulikov, Roni Paiss, Andrey Voynov, Inbar Mosseri, Tali Dekel, Tomer Michaeli

Abstract

Controlled video generation has seen drastic improvements in recent years. However, editing actions and dynamic events, or inserting contents that should affect the behaviors of other objects in real-world videos, remains a major challenge. Existing trained models struggle with complex edits, likely due to the difficulty of collecting relevant training data. Similarly, existing training-free methods are inherently restricted to structure- and motion-preserving edits and do not support modification of motion or interactions. Here, we introduce DynaEdit, a training-free editing method that unlocks versatile video editing capabilities with pretrained text-to-video flow models. Our method relies on the recently introduced inversion-free approach, which does not intervene in the model internals, and is thus model-agnostic. We show that naively attempting to adapt this approach to general unconstrained editing results in severe low-frequency misalignment and high-frequency jitter. We explain the sources for these phenomena and introduce novel mechanisms for overcoming them. Through extensive experiments, we show that DynaEdit achieves state-of-the-art results on complex text-based video editing tasks, including modifying actions, inserting objects that interact with the scene, and introducing global effects.

Versatile Editing of Video Content, Actions, and Dynamics without Training

Abstract

Controlled video generation has seen drastic improvements in recent years. However, editing actions and dynamic events, or inserting contents that should affect the behaviors of other objects in real-world videos, remains a major challenge. Existing trained models struggle with complex edits, likely due to the difficulty of collecting relevant training data. Similarly, existing training-free methods are inherently restricted to structure- and motion-preserving edits and do not support modification of motion or interactions. Here, we introduce DynaEdit, a training-free editing method that unlocks versatile video editing capabilities with pretrained text-to-video flow models. Our method relies on the recently introduced inversion-free approach, which does not intervene in the model internals, and is thus model-agnostic. We show that naively attempting to adapt this approach to general unconstrained editing results in severe low-frequency misalignment and high-frequency jitter. We explain the sources for these phenomena and introduce novel mechanisms for overcoming them. Through extensive experiments, we show that DynaEdit achieves state-of-the-art results on complex text-based video editing tasks, including modifying actions, inserting objects that interact with the scene, and introducing global effects.
Paper Structure (48 sections, 7 equations, 25 figures, 3 tables, 2 algorithms)

This paper contains 48 sections, 7 equations, 25 figures, 3 tables, 2 algorithms.

Figures (25)

  • Figure 1: Training-free versatile editing of actions and dynamics in videos. We present DynaEdit, a training-free flow-based method for video editing, which is the first to enable manipulation of dynamics and contents in videos using textual descriptions. DynaEdit supports the modification of actions and the insertion of objects that interact with the scene (e.g. causing a horse to jump due to a newly inserted obstacle, a cat to run off due to a toy edited to become a burning marshmallow, or a billiard ball to enter the pocket). It also allows global and stylistic modifications, like changing daytime to nighttime, all while avoiding unnecessary changes to the video (see SM for the videos).
  • Figure 2: Limitations of state-of-the-art inversion-free editing methods. Existing inversion-free methods kulikov2025floweditli2025flowdirector0kim2025flowaligntrajectoryregularizedinversionfreeflowbased suffer from a tradeoff between edit expressivity and visual quality, illustrated here with FlowEdit kulikov2025flowedit using an I2V model. When starting the generation at timestep $n_{\text{max}}=N-1$, the method struggles to modify motion (second row). Using $n_{\text{max}}=N$ allows making more significant spatio-temporal modifications and thus to better adhere to the prompt, but results in severe jitter artifacts and illogical motions (third row). Attempting to reduce artifacts by averaging over $n_{\text{avg}}=100$ edit directions in each step leads to blur (fourth row).
  • Figure 3: Effects of noise in inversion-free editing. (a) A source video (three frames and a spatio-temporal slice corresponding to the dashed line). (b) Three inversion-free editing results (FlowEdit) differing only in the noise sample at timestep $t_N$. As seen, the initial noise strongly affects coarse spatio-temporal features, e.g. modifying the camera motion and train position across edits, although those features are not required to change to adhere to the prompt. (c) Using independent noise samples across timesteps (top) leads to high-frequency jitter, e.g. the blurry bucket and paint drops. This can be alleviated by using the same noise sample for all timesteps (bottom) but at the cost of worse alignment with the source video, e.g. causing the bucket to levitate.
  • Figure 4: DynaEdit. Our method constructs a noise free path from the source video to the edited one (top pane). The middle pane shows one step in this process, with our key contributions colored. Our SGA module (bottom left) aggregates several noise-free velocities based on the similarities between the edits they induce and the source video. Our ANC mechanism (bottom right) induces gradually increasing correlations between the noises of consecutive timesteps.
  • Figure 5: DynaEdit Results. Our method supports a wide range of edits, including motion manipulation (swans), interactive object addition (horse, barrier, dinosaur), and global style changes (magma, Manga).
  • ...and 20 more figures