Table of Contents
Fetching ...

Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

Assaf Singer, Noam Rotstein, Amir Mann, Ron Kimmel, Or Litany

TL;DR

Time-to-Move (TTM) tackles precise motion control in diffusion-based video generation without training. It introduces a training-free, plug-and-play approach that uses crude user-provided animations as motion cues and applies region-aware dual-clock denoising during sampling, anchored by the input image to preserve appearance. The method combines an SDEdit-inspired motion injection with a region-dependent denoising schedule and supports joint appearance control by conditioning on full reference frames, achieving competitive or superior motion fidelity on object and camera benchmarks across backbones. This approach enables interactive content authoring with reduced computational cost and without retraining, expanding practical capabilities for video editing and animation prototyping.

Abstract

Diffusion-based video generation can create realistic videos, yet existing image- and text-based conditioning fails to offer precise motion control. Prior methods for motion-conditioned synthesis typically require model-specific fine-tuning, which is computationally expensive and restrictive. We introduce Time-to-Move (TTM), a training-free, plug-and-play framework for motion- and appearance-controlled video generation with image-to-video (I2V) diffusion models. Our key insight is to use crude reference animations obtained through user-friendly manipulations such as cut-and-drag or depth-based reprojection. Motivated by SDEdit's use of coarse layout cues for image editing, we treat the crude animations as coarse motion cues and adapt the mechanism to the video domain. We preserve appearance with image conditioning and introduce dual-clock denoising, a region-dependent strategy that enforces strong alignment in motion-specified regions while allowing flexibility elsewhere, balancing fidelity to user intent with natural dynamics. This lightweight modification of the sampling process incurs no additional training or runtime cost and is compatible with any backbone. Extensive experiments on object and camera motion benchmarks show that TTM matches or exceeds existing training-based baselines in realism and motion control. Beyond this, TTM introduces a unique capability: precise appearance control through pixel-level conditioning, exceeding the limits of text-only prompting. Visit our project page for video examples and code: https://time-to-move.github.io/.

Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

TL;DR

Time-to-Move (TTM) tackles precise motion control in diffusion-based video generation without training. It introduces a training-free, plug-and-play approach that uses crude user-provided animations as motion cues and applies region-aware dual-clock denoising during sampling, anchored by the input image to preserve appearance. The method combines an SDEdit-inspired motion injection with a region-dependent denoising schedule and supports joint appearance control by conditioning on full reference frames, achieving competitive or superior motion fidelity on object and camera benchmarks across backbones. This approach enables interactive content authoring with reduced computational cost and without retraining, expanding practical capabilities for video editing and animation prototyping.

Abstract

Diffusion-based video generation can create realistic videos, yet existing image- and text-based conditioning fails to offer precise motion control. Prior methods for motion-conditioned synthesis typically require model-specific fine-tuning, which is computationally expensive and restrictive. We introduce Time-to-Move (TTM), a training-free, plug-and-play framework for motion- and appearance-controlled video generation with image-to-video (I2V) diffusion models. Our key insight is to use crude reference animations obtained through user-friendly manipulations such as cut-and-drag or depth-based reprojection. Motivated by SDEdit's use of coarse layout cues for image editing, we treat the crude animations as coarse motion cues and adapt the mechanism to the video domain. We preserve appearance with image conditioning and introduce dual-clock denoising, a region-dependent strategy that enforces strong alignment in motion-specified regions while allowing flexibility elsewhere, balancing fidelity to user intent with natural dynamics. This lightweight modification of the sampling process incurs no additional training or runtime cost and is compatible with any backbone. Extensive experiments on object and camera motion benchmarks show that TTM matches or exceeds existing training-based baselines in realism and motion control. Beyond this, TTM introduces a unique capability: precise appearance control through pixel-level conditioning, exceeding the limits of text-only prompting. Visit our project page for video examples and code: https://time-to-move.github.io/.

Paper Structure

This paper contains 28 sections, 1 equation, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Qualitative results of Time-to-Move on various tasks.
  • Figure 2: Overview of Time-to-Move. Given an input image and a motion instruction, a mask marks the region under strong control. A motion signal is then generated automatically and, together with the image, conditions an image-to-video (I2V) diffusion model. During sampling, denoising starts at different noise levels—lower inside the mask to enforce the specified motion, and higher outside to allow natural deviations in the background. Joint sampling then yields a realistic video that preserves input details while accurately following the motion control.
  • Figure 3: Region-dependent denoising strategies. SDEdit (single clock): low noise levels overconstrain the video, suppressing non-masked region dynamics; high noise levels improve realism but drift from the prescribed motion. RePaint (foreground override): motion is enforced in the object, but uncontrolled regions exhibit artifacts such as duplication. Dual-clock (ours): masked regions follow the intended motion with strong fidelity, while the background denoises more freely, yielding realistic dynamics without artifacts.
  • Figure 4: Qualitative comparison on MC-Bench Competing methods exhibit artifacts (red), whereas TTM achieves clean placement and appearance consistency.
  • Figure 5: Comparison on a challenging cut-and-drag example. GWTF exhibits strong artifacts under large motion (right); TTM follows the prescribed motion realistically across various models.
  • ...and 12 more figures