Table of Contents
Fetching ...

Towards Precise Scaling Laws for Video Diffusion Transformers

Yuanyang Yin, Yaqi Zhao, Mingwu Zheng, Ke Lin, Jiarong Ou, Rui Chen, Victor Shea-Jay Huang, Jiahao Wang, Xin Tao, Pengfei Wan, Di Zhang, Baoqun Yin, Wentao Zhang, Kun Gai

TL;DR

This work demonstrates that video diffusion transformers exhibit meaningful scaling laws and that hyperparameters such as batch size and learning rate critically shape performance. It introduces explicit scaling laws for optimal hyperparameters, $B_{opt}(N,T)$ and $\eta_{opt}(N,T)$, enabling precise model-size and compute-budget planning. Under optimal hyperparameters, the approach achieves comparable validation loss with substantially lower inference costs and provides a generalized $L(T,N)$ framework to predict performance across non-optimal configurations. The findings advance practical scalability for video generation by enabling accurate extrapolations to larger models and budgets and highlight the importance of hyperparameter tuning in scaling analyses.

Abstract

Achieving optimal performance of video diffusion transformers within given data and compute budget is crucial due to their high training costs. This necessitates precisely determining the optimal model size and training hyperparameters before large-scale training. While scaling laws are employed in language models to predict performance, their existence and accurate derivation in visual generation models remain underexplored. In this paper, we systematically analyze scaling laws for video diffusion transformers and confirm their presence. Moreover, we discover that, unlike language models, video diffusion models are more sensitive to learning rate and batch size, two hyperparameters often not precisely modeled. To address this, we propose a new scaling law that predicts optimal hyperparameters for any model size and compute budget. Under these optimal settings, we achieve comparable performance and reduce inference costs by 40.1% compared to conventional scaling methods, within a compute budget of 1e10 TFlops. Furthermore, we establish a more generalized and precise relationship among validation loss, any model size, and compute budget. This enables performance prediction for non-optimal model sizes, which may also be appealed under practical inference cost constraints, achieving a better trade-off.

Towards Precise Scaling Laws for Video Diffusion Transformers

TL;DR

This work demonstrates that video diffusion transformers exhibit meaningful scaling laws and that hyperparameters such as batch size and learning rate critically shape performance. It introduces explicit scaling laws for optimal hyperparameters, and , enabling precise model-size and compute-budget planning. Under optimal hyperparameters, the approach achieves comparable validation loss with substantially lower inference costs and provides a generalized framework to predict performance across non-optimal configurations. The findings advance practical scalability for video generation by enabling accurate extrapolations to larger models and budgets and highlight the importance of hyperparameter tuning in scaling analyses.

Abstract

Achieving optimal performance of video diffusion transformers within given data and compute budget is crucial due to their high training costs. This necessitates precisely determining the optimal model size and training hyperparameters before large-scale training. While scaling laws are employed in language models to predict performance, their existence and accurate derivation in visual generation models remain underexplored. In this paper, we systematically analyze scaling laws for video diffusion transformers and confirm their presence. Moreover, we discover that, unlike language models, video diffusion models are more sensitive to learning rate and batch size, two hyperparameters often not precisely modeled. To address this, we propose a new scaling law that predicts optimal hyperparameters for any model size and compute budget. Under these optimal settings, we achieve comparable performance and reduce inference costs by 40.1% compared to conventional scaling methods, within a compute budget of 1e10 TFlops. Furthermore, we establish a more generalized and precise relationship among validation loss, any model size, and compute budget. This enables performance prediction for non-optimal model sizes, which may also be appealed under practical inference cost constraints, achieving a better trade-off.

Paper Structure

This paper contains 37 sections, 39 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Validation loss across various model sizes with different amounts of training tokens and hyperparameters. Each panel represents a different model size (0.02B, 0.06B, 0.13B, and 0.26B). Gray points denote the validation loss achieved with fixed suboptimal hyperparameters, while red points highlight the lowest validation loss obtained with the optimal batch size and learning rate. This demonstrates that selecting optimal hyperparameters is essential for correctly fitting the loss curve, ensuring more accurate alignment with expected scaling trends as model size and data size increase.
  • Figure 2: Optimal learning rate scaling curve. Left: Optimal learning rate scaling curves fitted on four different model sizes (0.02B, 0.06B, 0.13B, and 0.26B parameters). "Observations" indicates values within 0.02% of the minimum loss for each model size. Right: Extrapolated scaling curves for learning rate, predicting optimal values for a 1.07B model to achieve minimal validation loss.
  • Figure 3: Optimal batch size scaling curve. Left: Optimal batch size scaling curves fitted on four different model sizes (0.02B, 0.06B, 0.13B, and 0.26B parameters). "Observations" indicates values within 0.02% of the minimum loss for each model size. Right: Extrapolated scaling curves for batch size, predicting optimal values for a 1.07B model to achieve minimal validation loss.
  • Figure 4: Predictions of optimal hyperparameters on 1.07B model size with 4B and 10B training tokens. The red pentagrams indicate the predicted optimal batch size and learning rate, along with their predicted validation loss.
  • Figure 5: Loss scaling with optimal hyperparameters across varying model and compute scales. Left: Fitted loss curves under optimal hyperparameters across four smaller models, each trained with varying numbers of tokens. Right: Extrapolated loss curves extended to larger model sizes and compute budgets, offering predictions for any model size and traing tokens. The red pentagram indicates the projected loss for a 1.07B model with 10B training tokens, while the blue pentagram marks the expected loss for a 0.72B model under a compute budget of $5.85 \times 10^{20}$. Experimental results are shown as green and orange hexagons, validating the extrapolation accuracy.
  • ...and 4 more figures