Table of Contents
Fetching ...

Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning

Changlin Li, Jiawei Zhang, Shuhao Liu, Sihao Lin, Zeyi Shi, Zhihui Li, Xiaojun Chang

TL;DR

This work tackles the high training cost of diffusion models for high-resolution, multi-frame human video generation. It introduces Entropy-Guided Prioritized Progressive Learning (Ent-Prog), which combines Conditional Entropy Inflation (CEI) to rank and prioritize training of network blocks with an adaptive progressive schedule guided by a Nested Diffusion Supernet to maximize convergence efficiency. Empirical results on three diverse datasets show up to 2.2× training speedup and 2.4× GPU memory reduction without compromising generative quality or pose adherence. The approach enables scalable, efficient, and controllable human video generation with practical implications for real-world deployment and downstream applications.

Abstract

Human video generation has advanced rapidly with the development of diffusion models, but the high computational cost and substantial memory consumption associated with training these models on high-resolution, multi-frame data pose significant challenges. In this paper, we propose Entropy-Guided Prioritized Progressive Learning (Ent-Prog), an efficient training framework tailored for diffusion models on human video generation. First, we introduce Conditional Entropy Inflation (CEI) to assess the importance of different model components on the target conditional generation task, enabling prioritized training of the most critical components. Second, we introduce an adaptive progressive schedule that adaptively increases computational complexity during training by measuring the convergence efficiency. Ent-Prog reduces both training time and GPU memory consumption while maintaining model performance. Extensive experiments across three datasets, demonstrate the effectiveness of Ent-Prog, achieving up to 2.2$\times$ training speedup and 2.4$\times$ GPU memory reduction without compromising generative performance.

Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning

TL;DR

This work tackles the high training cost of diffusion models for high-resolution, multi-frame human video generation. It introduces Entropy-Guided Prioritized Progressive Learning (Ent-Prog), which combines Conditional Entropy Inflation (CEI) to rank and prioritize training of network blocks with an adaptive progressive schedule guided by a Nested Diffusion Supernet to maximize convergence efficiency. Empirical results on three diverse datasets show up to 2.2× training speedup and 2.4× GPU memory reduction without compromising generative quality or pose adherence. The approach enables scalable, efficient, and controllable human video generation with practical implications for real-world deployment and downstream applications.

Abstract

Human video generation has advanced rapidly with the development of diffusion models, but the high computational cost and substantial memory consumption associated with training these models on high-resolution, multi-frame data pose significant challenges. In this paper, we propose Entropy-Guided Prioritized Progressive Learning (Ent-Prog), an efficient training framework tailored for diffusion models on human video generation. First, we introduce Conditional Entropy Inflation (CEI) to assess the importance of different model components on the target conditional generation task, enabling prioritized training of the most critical components. Second, we introduce an adaptive progressive schedule that adaptively increases computational complexity during training by measuring the convergence efficiency. Ent-Prog reduces both training time and GPU memory consumption while maintaining model performance. Extensive experiments across three datasets, demonstrate the effectiveness of Ent-Prog, achieving up to 2.2 training speedup and 2.4 GPU memory reduction without compromising generative performance.

Paper Structure

This paper contains 16 sections, 5 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: The impact of freezing or skipping blocks. (a) illustrates the effect of freezing different numbers of blocks on the model’s final convergence performance, showing a clear decline as more blocks are frozen. (b) and (c) present the loss and Conditional Entropy Inflation (CEI) when randomly skipping 8 to 23 blocks, emphasizing the objective of accelerating convergence by selectively skipping blocks with lower interaction (characterized by lower CEI and loss). (d) compares the training dynamics of the most important 10 blocks and the least important 10 blocks, with all other blocks frozen. It is clear that the more influential blocks contribute to faster convergence and better model performance.
  • Figure 2: Human video generation results by Ent-Prog with up to 2.1× training acceleration and 2.4× lower training VRAM usage. Given a reference image (left image for each clip), we generate consistent and controllable human dance videos after Ent-Prog efficient training.
  • Figure 3: Illustration of deciding training priority of blocks in diffusion model with Conditional Entropy Inflation. Blocks in deeper colors after ranking indicate higher conditional entropy inflation when the block is skipped, suggesting that the block is more critical for conditional adherence.
  • Figure 4: Illustration of adaptive progressive schedule. We search within the defined space to identify sub-networks with optimal convergence efficiency. At the start of each progressive learning stage, we first train a one-shot supernet, and evaluate the convergence efficiency of each unfreezing choice. The sub-network with optimal convergence efficiency is then selected, inheriting parameters from the supernet and continuing training in the next phase.
  • Figure 5: Qualitative comparison of two training methods on the Bilibili dataset. The red boxes indicate the defects of the generated images. Ent-Prog surpasses full training in terms of visual coherence and realism, and it also excels in restoring fine-grained details such as facial expressions. The first row highlights the unreasonable artifacts in the results generated by the full training method, which are absent in Ent-Prog. The second row marks the shortcomings of the full training method in restoring facial features.
  • ...and 4 more figures