Table of Contents
Fetching ...

FastVMT: Eliminating Redundancy in Video Motion Transfer

Yue Ma, Zhikai Wang, Tianhao Ren, Mingzhe Zheng, Hongyu Liu, Jiayi Guo, Mark Fong, Yuxuan Xue, Zixiang Zhao, Konrad Schindler, Qifeng Chen, Linfeng Zhang

TL;DR

FastVMT tackles inefficiencies in training-free video motion transfer that use diffusion-transformer backbones. It identifies motion redundancy from large-scale attention and gradient redundancy across diffusion steps, and remedies them with a sliding-window motion extraction and a corresponding-window loss, plus a step-skipping gradient optimization to reuse gradients. The method achieves a 3.43× average speedup and up to 14.91× lower latency while preserving visual fidelity and temporal consistency across complex motions and camera dynamics. This approach enables real-time or open-domain motion transfer with high-quality results and broad applicability to single- and multi-object scenarios.

Abstract

Video motion transfer aims to synthesize videos by generating visual content according to a text prompt while transferring the motion pattern observed in a reference video. Recent methods predominantly use the Diffusion Transformer (DiT) architecture. To achieve satisfactory runtime, several methods attempt to accelerate the computations in the DiT, but fail to address structural sources of inefficiency. In this work, we identify and remove two types of computational redundancy in earlier work: motion redundancy arises because the generic DiT architecture does not reflect the fact that frame-to-frame motion is small and smooth; gradient redundancy occurs if one ignores that gradients change slowly along the diffusion trajectory. To mitigate motion redundancy, we mask the corresponding attention layers to a local neighborhood such that interaction weights are not computed unnecessarily distant image regions. To exploit gradient redundancy, we design an optimization scheme that reuses gradients from previous diffusion steps and skips unwarranted gradient computations. On average, FastVMT achieves a 3.43x speedup without degrading the visual fidelity or the temporal consistency of the generated videos.

FastVMT: Eliminating Redundancy in Video Motion Transfer

TL;DR

FastVMT tackles inefficiencies in training-free video motion transfer that use diffusion-transformer backbones. It identifies motion redundancy from large-scale attention and gradient redundancy across diffusion steps, and remedies them with a sliding-window motion extraction and a corresponding-window loss, plus a step-skipping gradient optimization to reuse gradients. The method achieves a 3.43× average speedup and up to 14.91× lower latency while preserving visual fidelity and temporal consistency across complex motions and camera dynamics. This approach enables real-time or open-domain motion transfer with high-quality results and broad applicability to single- and multi-object scenarios.

Abstract

Video motion transfer aims to synthesize videos by generating visual content according to a text prompt while transferring the motion pattern observed in a reference video. Recent methods predominantly use the Diffusion Transformer (DiT) architecture. To achieve satisfactory runtime, several methods attempt to accelerate the computations in the DiT, but fail to address structural sources of inefficiency. In this work, we identify and remove two types of computational redundancy in earlier work: motion redundancy arises because the generic DiT architecture does not reflect the fact that frame-to-frame motion is small and smooth; gradient redundancy occurs if one ignores that gradients change slowly along the diffusion trajectory. To mitigate motion redundancy, we mask the corresponding attention layers to a local neighborhood such that interaction weights are not computed unnecessarily distant image regions. To exploit gradient redundancy, we design an optimization scheme that reuses gradients from previous diffusion steps and skips unwarranted gradient computations. On average, FastVMT achieves a 3.43x speedup without degrading the visual fidelity or the temporal consistency of the generated videos.
Paper Structure (13 sections, 10 equations, 9 figures, 2 tables)

This paper contains 13 sections, 10 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Motivation of our method. Training-free video motion transfer can benefit from redundancies, both at the level of the DiT architecture and of the iterative diffusion process. (a) Motion redundancy: Video motion is small and locally consistent, so a motion token in one frame will only ever match tokens in the next frame within a local neighborhood. (b) Gradient redundancy: Gradient updates in consecutive optimization steps are mostly similar (visualized here with PCA). There is no need to recompute them at every single step.
  • Figure 2: Illustration of step-skipping gradient optimization. We observe that skipping some steps in the gradient optimization step does not degrade the motion transfer performance.
  • Figure 3: Overview of our method. Left: Given a reference video, we first leverage the sliding window to extract motion embedding from attention during the inversion stage. At the denoising stage, we calculate the total loss and leverage the step-skipping gradient optimization to guide the video generation. Right: The Step-skipping gradient optimization is proposed to improve gradient redundancy. Additionally, we introduce the corresponding-window loss to boost the motion consistency of generated videos.
  • Figure 4: Illustration of attention motion flow extraction with sliding window. Without the sliding window, attention tokens are prone to incorrect correspondences (middle). Incorporating a sliding window improves alignment, leading to better motion consistency (right).
  • Figure 5: Gallery of our method. Given a reference video, our FastVMT is capable of generating high-quality video clips that faithfully preserve diverse motion patterns.
  • ...and 4 more figures