Table of Contents
Fetching ...

FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters

Shao Shitong, Gu Yufei, Xie Zeke

TL;DR

This paper proposes FastLightGen, an algorithm that transforms large, computationally expensive models into fast, lightweight counterparts, and reveals that a generator using 4-step sampling and 30\% parameter pruning achieves optimal visual quality under a constrained inference budget.

Abstract

The recent advent of powerful video generation models, such as Hunyuan, WanX, Veo3, and Kling, has inaugurated a new era in the field. However, the practical deployment of these models is severely impeded by their substantial computational overhead, which stems from enormous parameter counts and the iterative, multi-step sampling process required during inference. Prior research on accelerating generative models has predominantly followed two distinct trajectories: reducing the number of sampling steps (e.g., LCM, DMD, and MagicDistillation) or compressing the model size for more efficient inference (e.g., ICMD). The potential of simultaneously compressing both to create a fast and lightweight model remains an unexplored avenue. In this paper, we propose FastLightGen, an algorithm that transforms large, computationally expensive models into fast, lightweight counterparts. The core idea is to construct an optimal teacher model, one engineered to maximize student performance, within a synergistic framework for distilling both model size and inference steps. Our extensive experiments on HunyuanVideo-ATI2V and WanX-TI2V reveal that a generator using 4-step sampling and 30\% parameter pruning achieves optimal visual quality under a constrained inference budget. Furthermore, FastLightGen consistently outperforms all competing methods, establishing a new state-of-the-art in efficient video generation.

FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters

TL;DR

This paper proposes FastLightGen, an algorithm that transforms large, computationally expensive models into fast, lightweight counterparts, and reveals that a generator using 4-step sampling and 30\% parameter pruning achieves optimal visual quality under a constrained inference budget.

Abstract

The recent advent of powerful video generation models, such as Hunyuan, WanX, Veo3, and Kling, has inaugurated a new era in the field. However, the practical deployment of these models is severely impeded by their substantial computational overhead, which stems from enormous parameter counts and the iterative, multi-step sampling process required during inference. Prior research on accelerating generative models has predominantly followed two distinct trajectories: reducing the number of sampling steps (e.g., LCM, DMD, and MagicDistillation) or compressing the model size for more efficient inference (e.g., ICMD). The potential of simultaneously compressing both to create a fast and lightweight model remains an unexplored avenue. In this paper, we propose FastLightGen, an algorithm that transforms large, computationally expensive models into fast, lightweight counterparts. The core idea is to construct an optimal teacher model, one engineered to maximize student performance, within a synergistic framework for distilling both model size and inference steps. Our extensive experiments on HunyuanVideo-ATI2V and WanX-TI2V reveal that a generator using 4-step sampling and 30\% parameter pruning achieves optimal visual quality under a constrained inference budget. Furthermore, FastLightGen consistently outperforms all competing methods, establishing a new state-of-the-art in efficient video generation.
Paper Structure (23 sections, 9 equations, 10 figures, 9 tables)

This paper contains 23 sections, 9 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: The model performance is calculated as the average of Vbench's dynamic degree, aesthetic, and image quality metrics, plotted against sampling steps and parameter retention rate. The parameter-sampling steps trade-offs in FastLightGen compression reveal that performance is initially more sensitive to a reduction in sampling steps than to parameter pruning. As the overall compression ratio increases, the performance degradation from either method begins to converge. Notably, the model with only 30% of its parameters at 4 steps performs on par with the 100% parameter model at 1.2 steps. We report that a moderate compression strategy, specifically, retaining 70% of parameters at 4 sampling steps—strikes a highly effective trade-off, achieving a theoretical speedup of $\approx$35.71$\times$ over the 50-step unpruned baseline which relies on classifier-free guidance (CFG) nips2021_classifier_free_guidance.
  • Figure 2: Vbench average score vs. time spent. MgD: MagicDistillation. Among the accelerated sampling algorithms evaluated, our proposed FastLightGen achieves the greatest speedup, while its performance also surpasses that of the teacher model (Euler).
  • Figure 3: Overview of three-stage distillation pipeline FastLightGen. FastLightGen begins by identifying less critical layers within the model. The second stage introduces dynamic probabilistic pruning, where these identified layers are stochastically skipped during training. This process yields a robust, stochastically-pruned student model for the final stage. In this final stage, we perform distribution matching. Our "well-guided teacher guidance", which is constructed from the stochastically-pruned student, ensures that the resulting lightweight, few-step generator maintains high performance.
  • Figure 4: In the first phase of FastLightGen, we identify non-critical layers by visualizing their importance, where a higher error value corresponds to greater layer criticality. For both HunyuanVideo-TI2V and WanX-TI2V, the results reveal a consistent pattern: the initial and final layers are the most critical, while the importance of the intermediate layers is substantially lower.
  • Figure 5: Visualization of FastLightGen (i.e., 4-step generator that retains 70% of the parameters) across diverse scenarios, including landscapes, food vlogging, dance, and daily activities. Our model generates high-fidelity videos characterized by realistic character motion, detailed expressions, and strong temporal dynamics.
  • ...and 5 more figures