Table of Contents
Fetching ...

Training-Free Motion-Guided Video Generation with Enhanced Temporal Consistency Using Motion Consistency Loss

Xinyu Zhang, Zicheng Duan, Dong Gong, Lingqiao Liu

TL;DR

This work tackles temporally inconsistent motion-guided video generation in a training-free setting. It couples an inversion-noise initialization derived from a reference video with a novel motion-consistency objective that operates on inter-frame feature correlations to steer diffusion-based generation toward the reference motion while preserving frame fidelity. The core contributions are the motion pattern extraction from sparse points, the loss L_c and its gradient-guided integration into denoising, and the demonstrated improvements on multiple benchmarks for both trajectory-based and reference-video-based control. The approach is efficient, model-agnostic, and compatible with a range of video diffusion models, enabling robust, temporally coherent motion-guided video generation without training or fine-tuning.

Abstract

In this paper, we address the challenge of generating temporally consistent videos with motion guidance. While many existing methods depend on additional control modules or inference-time fine-tuning, recent studies suggest that effective motion guidance is achievable without altering the model architecture or requiring extra training. Such approaches offer promising compatibility with various video generation foundation models. However, existing training-free methods often struggle to maintain consistent temporal coherence across frames or to follow guided motion accurately. In this work, we propose a simple yet effective solution that combines an initial-noise-based approach with a novel motion consistency loss, the latter being our key innovation. Specifically, we capture the inter-frame feature correlation patterns of intermediate features from a video diffusion model to represent the motion pattern of the reference video. We then design a motion consistency loss to maintain similar feature correlation patterns in the generated video, using the gradient of this loss in the latent space to guide the generation process for precise motion control. This approach improves temporal consistency across various motion control tasks while preserving the benefits of a training-free setup. Extensive experiments show that our method sets a new standard for efficient, temporally coherent video generation.

Training-Free Motion-Guided Video Generation with Enhanced Temporal Consistency Using Motion Consistency Loss

TL;DR

This work tackles temporally inconsistent motion-guided video generation in a training-free setting. It couples an inversion-noise initialization derived from a reference video with a novel motion-consistency objective that operates on inter-frame feature correlations to steer diffusion-based generation toward the reference motion while preserving frame fidelity. The core contributions are the motion pattern extraction from sparse points, the loss L_c and its gradient-guided integration into denoising, and the demonstrated improvements on multiple benchmarks for both trajectory-based and reference-video-based control. The approach is efficient, model-agnostic, and compatible with a range of video diffusion models, enabling robust, temporally coherent motion-guided video generation without training or fine-tuning.

Abstract

In this paper, we address the challenge of generating temporally consistent videos with motion guidance. While many existing methods depend on additional control modules or inference-time fine-tuning, recent studies suggest that effective motion guidance is achievable without altering the model architecture or requiring extra training. Such approaches offer promising compatibility with various video generation foundation models. However, existing training-free methods often struggle to maintain consistent temporal coherence across frames or to follow guided motion accurately. In this work, we propose a simple yet effective solution that combines an initial-noise-based approach with a novel motion consistency loss, the latter being our key innovation. Specifically, we capture the inter-frame feature correlation patterns of intermediate features from a video diffusion model to represent the motion pattern of the reference video. We then design a motion consistency loss to maintain similar feature correlation patterns in the generated video, using the gradient of this loss in the latent space to guide the generation process for precise motion control. This approach improves temporal consistency across various motion control tasks while preserving the benefits of a training-free setup. Extensive experiments show that our method sets a new standard for efficient, temporally coherent video generation.
Paper Structure (17 sections, 4 equations, 14 figures, 2 tables, 1 algorithm)

This paper contains 17 sections, 4 equations, 14 figures, 2 tables, 1 algorithm.

Figures (14)

  • Figure 1: Visualization comparisons on our method and two existing motion customization methods, including the reference video based Motiondirector zhao2025motiondirector, and the bounding box trajectory based FreeTraj qiu2024freetraj. Methods in the upper part use the inversion noise from the reference video, while methods in the lower part use the well-designed noise as initialization. The red circle regions represent the inconsistent temporal coherent, while the green circle regions represent the correct one.
  • Figure 2: Overview of our method. We first conduct (a) inversion noise initialization on the reference video to obtain the initial noise $\mathbf{z}_T$ (Section \ref{['sec:initial_noise']}). Then we (b) extract the motion pattern $\pmb{\mathcal{M}}$ from the reference video for each tracked point $p$ (Section \ref{['sec:motion_traj_extract']}). During the (c) denoising process, we use the proposed frame-to-frame motion consistency loss $\mathcal{L}_c$, calculated with Eq. \ref{['eq:consistency_loss']} based on $\pmb{\mathcal{M}}$ and newly extracted $\pmb{\mathcal{M}}'$ from the noise $\mathbf{z}_t$ as the motion guidance for the noise estimation (Section \ref{['sec:consistency_loss']}). The detail of our method is in Algorithm \ref{['algo']}.
  • Figure 3: Qualitative comparison of trajectory control. We evaluate our method and other trajectory based approaches, i.e., Peekaboo jain2024peekaboo and FreeTraj qiu2024freetraj. The "Direct" means the direct inference with random noise and no other guidance. We use the same initial noises as in qiu2024freetraj for better visual comparison. Our method shows better ability on trajectory follow and temporal coherent consistency.
  • Figure 4: Qualitative comparison of reference video control. We evaluate our method and MotionDirector zhao2025motiondirector. The red circle represents the given point clicked by users. The red and green rectangle are highlight areas to show the temporal coherent clearly. We keep the initial noises same in qiu2024freetraj and our method for fair comparison.
  • Figure 5: Ablation study on each component in our method, including the inversion noise initialization and frame-to-frame consistency guidance.
  • ...and 9 more figures