Table of Contents
Fetching ...

Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training

Xiangyang Luo, Qingyu Li, Yuming Li, Guanbo Huang, Yongjie Zhu, Wenyu Qin, Meng Wang, Pengfei Wan, Shao-Lun Huang

Abstract

Recent advances in video generation models have achieved impressive results. However, these models heavily rely on the use of high-quality data that combines both high visual quality and high motion quality. In this paper, we identify a key challenge in video data curation: the Motion-Vision Quality Dilemma. We discovered that visual quality and motion intensity inherently exhibit a negative correlation, making it hard to obtain golden data that excels in both aspects. To address this challenge, we first examine the hierarchical learning dynamics of video diffusion models and conduct gradient-based analysis on quality-degraded samples. We discover that quality-imbalanced data can produce gradients similar to golden data at appropriate timesteps. Based on this, we introduce the novel concept of Timestep selection in Training Process. We propose Timestep-aware Quality Decoupling (TQD), which modifies the data sampling distribution to better match the model's learning process. For certain types of data, the sampling distribution is skewed toward higher timesteps for motion-rich data, while high visual quality data is more likely to be sampled during lower timesteps. Through extensive experiments, we demonstrate that TQD enables training exclusively on separated imbalanced data to achieve performance surpassing conventional training with better data, challenging the necessity of perfect data in video generation. Moreover, our method also boosts model performance when trained on high-quality data, showcasing its effectiveness across different data scenarios.

Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training

Abstract

Recent advances in video generation models have achieved impressive results. However, these models heavily rely on the use of high-quality data that combines both high visual quality and high motion quality. In this paper, we identify a key challenge in video data curation: the Motion-Vision Quality Dilemma. We discovered that visual quality and motion intensity inherently exhibit a negative correlation, making it hard to obtain golden data that excels in both aspects. To address this challenge, we first examine the hierarchical learning dynamics of video diffusion models and conduct gradient-based analysis on quality-degraded samples. We discover that quality-imbalanced data can produce gradients similar to golden data at appropriate timesteps. Based on this, we introduce the novel concept of Timestep selection in Training Process. We propose Timestep-aware Quality Decoupling (TQD), which modifies the data sampling distribution to better match the model's learning process. For certain types of data, the sampling distribution is skewed toward higher timesteps for motion-rich data, while high visual quality data is more likely to be sampled during lower timesteps. Through extensive experiments, we demonstrate that TQD enables training exclusively on separated imbalanced data to achieve performance surpassing conventional training with better data, challenging the necessity of perfect data in video generation. Moreover, our method also boosts model performance when trained on high-quality data, showcasing its effectiveness across different data scenarios.

Paper Structure

This paper contains 25 sections, 5 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: The motion-vision dilemma in video data. (a) Illustrates two cases: a high-MQ, low-VQ action scene (left) and a low-MQ, high-VQ static scene (right), with their corresponding optical flow maps below visualizing the difference in motion intensity. (b) The 2D density plot over Koala36M dataset confirms this trade-off. MQ and VQ exhibit a negative correlation, with the majority of samples ($56.3\%$ total) falling into the high-VQ/low-MQ or low-VQ/high-MQ quadrants, as defined by the median lines. Data with both high qualities is less common ($21.9\%$).
  • Figure 2: (a) Video diffusion models exhibit hierarchical denoising: high-noise stages ($t=0.9, 0.6$) establish motion and composition, while low-noise stages ($t=0.3, 0.0$) refine details and textures. (b) Gradient analysis under quality degradations, averaged over 120 samples with three VQ degradation types (blur, compression, noise) and MQ degradation (shuffle). VQ degradation aligns with the original at high timesteps, while MQ degradation aligns at low timesteps, revealing that quality-imbalanced samples can match golden data's gradients at appropriate timesteps and should be strategically allocated across the denoising process.
  • Figure 3: Overview of our timestep selective training process. Given bimodal quality distributions in training data (left), TQD employs quality-based sample dropout (middle) and adaptive timestep sampling via Beta distributions (right). Samples with high motion intensity are directed toward large timesteps, those with high visual quality toward small timesteps, achieving specialized learning across denoising process.
  • Figure 4: Qualitative comparison. TQD demonstrates superior performance across multiple quality dimensions. Top rows: improved visual quality with reduced hand distortion (left) and better motion coherence matching prompt dynamics (right). Bottom rows: enhanced physical plausibility in foam dissipation (left) and liquid dynamics (right), as highlighted in red boxes.
  • Figure 5: Timestep sampling distributions under varying quality scores. We visualize the Beta distributions with $\kappa_{\text{base}} = 4$ (top) and $\kappa_{\text{base}} = 2$ (bottom) for samples with different MQ and VQ scores. Left: High MQ, low VQ shifts toward large timesteps. Middle: Equal MQ and VQ yields centered distributions. Right: Low MQ, high VQ concentrates on small timesteps. Note that $\kappa_{\text{base}} = 2$ degenerates to uniform when MQ = VQ (middle-bottom).
  • ...and 2 more figures