Table of Contents
Fetching ...

Video Diffusion Models are Training-free Motion Interpreter and Controller

Zeqi Xiao, Yifan Zhou, Shuai Yang, Xingang Pan

TL;DR

Video diffusion models encode cross-frame motion but typical motion control relies on training-based modules that are resource-intensive and model-specific. The authors uncover a robust, interpretable motion feature by removing content correlations and applying PCA, naming it MOFT, which can be extracted without training and generalizes across architectures. They then build a training-free MOFT-guided motion-control framework that optimizes denoising latents using MOFT guidance, optionally using reference MOFT from inversion or statistics. Experiments demonstrate competitive motion fidelity and naturalness across models and introduce point-drag manipulation, highlighting practical impact for flexible, resource-efficient video editing.

Abstract

Video generation primarily aims to model authentic and customized motion across frames, making understanding and controlling the motion a crucial topic. Most diffusion-based studies on video motion focus on motion customization with training-based paradigms, which, however, demands substantial training resources and necessitates retraining for diverse models. Crucially, these approaches do not explore how video diffusion models encode cross-frame motion information in their features, lacking interpretability and transparency in their effectiveness. To answer this question, this paper introduces a novel perspective to understand, localize, and manipulate motion-aware features in video diffusion models. Through analysis using Principal Component Analysis (PCA), our work discloses that robust motion-aware feature already exists in video diffusion models. We present a new MOtion FeaTure (MOFT) by eliminating content correlation information and filtering motion channels. MOFT provides a distinct set of benefits, including the ability to encode comprehensive motion information with clear interpretability, extraction without the need for training, and generalizability across diverse architectures. Leveraging MOFT, we propose a novel training-free video motion control framework. Our method demonstrates competitive performance in generating natural and faithful motion, providing architecture-agnostic insights and applicability in a variety of downstream tasks.

Video Diffusion Models are Training-free Motion Interpreter and Controller

TL;DR

Video diffusion models encode cross-frame motion but typical motion control relies on training-based modules that are resource-intensive and model-specific. The authors uncover a robust, interpretable motion feature by removing content correlations and applying PCA, naming it MOFT, which can be extracted without training and generalizes across architectures. They then build a training-free MOFT-guided motion-control framework that optimizes denoising latents using MOFT guidance, optionally using reference MOFT from inversion or statistics. Experiments demonstrate competitive motion fidelity and naturalness across models and introduce point-drag manipulation, highlighting practical impact for flexible, resource-efficient video editing.

Abstract

Video generation primarily aims to model authentic and customized motion across frames, making understanding and controlling the motion a crucial topic. Most diffusion-based studies on video motion focus on motion customization with training-based paradigms, which, however, demands substantial training resources and necessitates retraining for diverse models. Crucially, these approaches do not explore how video diffusion models encode cross-frame motion information in their features, lacking interpretability and transparency in their effectiveness. To answer this question, this paper introduces a novel perspective to understand, localize, and manipulate motion-aware features in video diffusion models. Through analysis using Principal Component Analysis (PCA), our work discloses that robust motion-aware feature already exists in video diffusion models. We present a new MOtion FeaTure (MOFT) by eliminating content correlation information and filtering motion channels. MOFT provides a distinct set of benefits, including the ability to encode comprehensive motion information with clear interpretability, extraction without the need for training, and generalizability across diverse architectures. Leveraging MOFT, we propose a novel training-free video motion control framework. Our method demonstrates competitive performance in generating natural and faithful motion, providing architecture-agnostic insights and applicability in a variety of downstream tasks.
Paper Structure (22 sections, 9 equations, 16 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 9 equations, 16 figures, 4 tables, 1 algorithm.

Figures (16)

  • Figure 2: Visualization of PCA on video diffusion features. The left side indicates the frame-wise panning direction, with each color representing a specific direction pattern. We apply PCA to diffusion features extracted from videos with different motion directions and plot their projections on the leading two principle components. (a) The result does not exhibit a distinguishable correlation with motion direction. (b) Features are clearly separated by their motion direction.
  • Figure 3: Cross-frame Channel Value. (a) We plot the histogram of the weight of $\mathcal{P}_1$. It reveals that only a few channels significantly contribute to determining the principal components. (b-c) The motion channels exhibit a pronounced correlation with motion direction trends. (d) In contrast, the non-motion channels show little correspondence with motion direction.
  • Figure 4: Similarity heatmap between feature of the source point and target features. Given the red source point in (a), we plot the similarity heatmap on target videos. Yellow indicates regions with higher similarity. We normalize all similarity to 0-1 for better illustration. (b-d) Similarity heatmap of features with different designs. "CR" indicates "content removal". "MCF" indicates motion channel filter. (e-h) Similarity heatmap of MOFT in different layers in the U-Net. (2x) means relative spatial resolution scale 2. (i-l) Similarity heatmap of MOFT in different video generation models.
  • Figure 5: Motion Control Pipeline. We use reference MOFT as guidance and optimize latents to alter the sampling process. In one denoising step, we get the intermediate features and extract MOFT from it with content correlation removal and motion channel filter. We optimize the latents to alter the sampling process with the loss of masked MOFT and reference MOFT.
  • Figure 6: Effects of DIFT and MOFT on different denoising time steps. Given the source point in (a) (for DIFT) and (e) (for MOFT), we plot the similarity heat map of DIFT (b-d) and MOFT (f-h) of different denoising steps. Yellow indicates higher similarity. The red point in (b-d) indicates the position with highest similarity. It suggests that MOFT can provide more valid information than DIFT at the early denoising stages.
  • ...and 11 more figures