Table of Contents
Fetching ...

SHIFT: Motion Alignment in Video Diffusion Models with Adversarial Hybrid Fine-Tuning

Xi Ye, Wenjia Yang, Yangyang Xu, Xiaoyang Liu, Duo Su, Mengfei Xia, Jun Zhu

Abstract

Image-conditioned Video diffusion models achieve impressive visual realism but often suffer from weakened motion fidelity, e.g., reduced motion dynamics or degraded long-term temporal coherence, especially after fine-tuning. We study the problem of motion alignment in video diffusion models post-training. To address this, we introduce pixel-motion rewards based on pixel flux dynamics, capturing both instantaneous and long-term motion consistency. We further propose Smooth Hybrid Fine-tuning (SHIFT), a scalable reward-driven fine-tuning framework for video diffusion models. SHIFT fuses the normal supervised fine-tuning and advantage weighted fine-tuning into a unified framework. Benefiting from novel adversarial advantages, SHIFT improves convergence speed and mitigates reward hacking. Experiments show that our approach efficiently resolves dynamic-degree collapse in modern video diffusion models supervised fine-tuning.

SHIFT: Motion Alignment in Video Diffusion Models with Adversarial Hybrid Fine-Tuning

Abstract

Image-conditioned Video diffusion models achieve impressive visual realism but often suffer from weakened motion fidelity, e.g., reduced motion dynamics or degraded long-term temporal coherence, especially after fine-tuning. We study the problem of motion alignment in video diffusion models post-training. To address this, we introduce pixel-motion rewards based on pixel flux dynamics, capturing both instantaneous and long-term motion consistency. We further propose Smooth Hybrid Fine-tuning (SHIFT), a scalable reward-driven fine-tuning framework for video diffusion models. SHIFT fuses the normal supervised fine-tuning and advantage weighted fine-tuning into a unified framework. Benefiting from novel adversarial advantages, SHIFT improves convergence speed and mitigates reward hacking. Experiments show that our approach efficiently resolves dynamic-degree collapse in modern video diffusion models supervised fine-tuning.
Paper Structure (30 sections, 36 equations, 13 figures, 5 tables)

This paper contains 30 sections, 36 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Overview of the proposed approach. (a) Pixel-motion reward models: instantaneous reward based on optical-flow residual (top) and long-term reward based on trajectory dynamics (bottom). (b) The SHIFT fine-tuning framework.
  • Figure 2: Qualitative comparison of VBench-I2V Standard test examples generated by the base SVD model and different fine-tuned SVD variants.
  • Figure 3: Qualitative comparison on the WISA-80K validation set for the base Wan2.2-TI2V model and fine-tuned variants. SHIFT generates more realistic hydraulic press motion.
  • Figure 4: Ablation of SHIFT components over training epochs. All models start from the same pre-trained SVD checkpoint (epoch 0). Adding Noise Alignment (NA) strengthens the SFT anchor regularization, while the Adversarial RM further stabilizes training to avoid reward hacking.
  • Figure 5: Effect of temperature $\beta$ on training dynamics. (a) Smaller $\beta$ yields faster reward optimization. (b) The motion--appearance gap reveals reward hacking: $\beta{=}1$ peaks early then collapses as both metrics degrade; $\beta{=}10$ turns negative; $\beta{=}100$ preserves the pre-trained balance. (c) FVD monotonically increases with smaller $\beta$, confirming greater distributional drift under stronger RL signals.
  • ...and 8 more figures