Table of Contents
Fetching ...

Spectral Motion Alignment for Video Motion Transfer using Diffusion Models

Geon Yeong Park, Hyeonho Jeong, Sang Wan Lee, Jong Chul Ye

TL;DR

Motion transfer in diffusion-based video generation struggles to capture global motion and is vulnerable to frame-wise artifacts when relying on pixel-space residuals. The authors propose Spectral Motion Alignment (SMA), a frequency-domain framework that refines and aligns motion vectors using 1D wavelet-domain global alignment and 2D FFT-based local refinement, with amplitude/phase spectrum losses prioritizing low frequencies. SMA is designed to be compatible with a range of diffusion-based motion distillation methods and can extend to diffusion-feature space. Empirical results across both text-to-video and text-to-image diffusion setups show consistent improvements in motion accuracy and temporal coherence with minimal overhead, validating SMA’s utility for flexible video customization.

Abstract

The evolution of diffusion models has greatly impacted video generation and understanding. Particularly, text-to-video diffusion models (VDMs) have significantly facilitated the customization of input video with target appearance, motion, etc. Despite these advances, challenges persist in accurately distilling motion information from video frames. While existing works leverage the consecutive frame residual as the target motion vector, they inherently lack global motion context and are vulnerable to frame-wise distortions. To address this, we present Spectral Motion Alignment (SMA), a novel framework that refines and aligns motion vectors using Fourier and wavelet transforms. SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics, and mitigating spatial artifacts. Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.

Spectral Motion Alignment for Video Motion Transfer using Diffusion Models

TL;DR

Motion transfer in diffusion-based video generation struggles to capture global motion and is vulnerable to frame-wise artifacts when relying on pixel-space residuals. The authors propose Spectral Motion Alignment (SMA), a frequency-domain framework that refines and aligns motion vectors using 1D wavelet-domain global alignment and 2D FFT-based local refinement, with amplitude/phase spectrum losses prioritizing low frequencies. SMA is designed to be compatible with a range of diffusion-based motion distillation methods and can extend to diffusion-feature space. Empirical results across both text-to-video and text-to-image diffusion setups show consistent improvements in motion accuracy and temporal coherence with minimal overhead, validating SMA’s utility for flexible video customization.

Abstract

The evolution of diffusion models has greatly impacted video generation and understanding. Particularly, text-to-video diffusion models (VDMs) have significantly facilitated the customization of input video with target appearance, motion, etc. Despite these advances, challenges persist in accurately distilling motion information from video frames. While existing works leverage the consecutive frame residual as the target motion vector, they inherently lack global motion context and are vulnerable to frame-wise distortions. To address this, we present Spectral Motion Alignment (SMA), a novel framework that refines and aligns motion vectors using Fourier and wavelet transforms. SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics, and mitigating spatial artifacts. Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.
Paper Structure (33 sections, 21 equations, 13 figures, 3 tables, 1 algorithm)

This paper contains 33 sections, 21 equations, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: One-shot Video Motion Transfer via Spectral Motion Alignment using Cascaded Video Diffusion Models. SMA facilitates the capture of long-range (left) and complex (right) motion patterns within videos. Visit https://geonyeong-park.github.io/spectral-motion-alignment/ for a comprehensive view of the videos.
  • Figure 2: Overview. The proposed Spectral Motion Alignment (SMA) framework distills the motion information in frequency-domain. Considering the (latent) frame residuals as motion vectors, we first derive the denoised motion vector estimates. Then, the motion vector $\delta {\boldsymbol v}_0^n$ and its estimate $\delta \hat{{\boldsymbol v}}_0^n$ are aligned in both pixel-domain and frequency-domain. Our regularization includes (1) global motion alignment based on 1D wavelet-transform, and (2) local motion refinement based on 2D Fourier transform.
  • Figure 3: Comparison within MotionDirector framework.
  • Figure 4: Comparison within VMC framework using Show-1 video model (top) and DMT framework using Zeroscope video model (bottom). Each demonstrate the compatibility of SMA in pixel-space and feature-space, respectively.
  • Figure 5: Comparison within Tune-A-Video (Top) and ControlVideo-Depth (Bottom) baseline.
  • ...and 8 more figures