Table of Contents
Fetching ...

DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization

Zihan Ding, Chi Jin, Difan Liu, Haitian Zheng, Krishna Kumar Singh, Qiang Zhang, Yan Kang, Zhe Lin, Yuchen Liu

TL;DR

This work targets efficient text-to-video generation with diffusion models by introducing DOLLAR, a multi-objective distillation framework that fuses variational score distillation and consistency distillation, complemented by latent reward fine-tuning. The approach enables few-step generation (as few as 1–4 steps) while preserving high visual quality and semantic alignment, demonstrated on 10-second videos (128 frames at 12 FPS). It achieves state-of-the-art VBench scores (82.57) and substantial inference speedups (up to ×278.6) over the teacher, with human evaluations confirming improvements in visual quality and text-video alignment. The latent reward model enables reward-based fine-tuning without requiring differentiable pixel-space rewards or backprop through large decoders, delivering memory-efficient, flexible optimization across image, video, and text-conditioned rewards. These components collectively push toward near-real-time, high-quality T2V generation and offer a practical path for customization via reward metrics.

Abstract

Diffusion probabilistic models have shown significant progress in video generation; however, their computational efficiency is limited by the large number of sampling steps required. Reducing sampling steps often compromises video quality or generation diversity. In this work, we introduce a distillation method that combines variational score distillation and consistency distillation to achieve few-step video generation, maintaining both high quality and diversity. We also propose a latent reward model fine-tuning approach to further enhance video generation performance according to any specified reward metric. This approach reduces memory usage and does not require the reward to be differentiable. Our method demonstrates state-of-the-art performance in few-step generation for 10-second videos (128 frames at 12 FPS). The distilled student model achieves a score of 82.57 on VBench, surpassing the teacher model as well as baseline models Gen-3, T2V-Turbo, and Kling. One-step distillation accelerates the teacher model's diffusion sampling by up to 278.6 times, enabling near real-time generation. Human evaluations further validate the superior performance of our 4-step student models compared to teacher model using 50-step DDIM sampling.

DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization

TL;DR

This work targets efficient text-to-video generation with diffusion models by introducing DOLLAR, a multi-objective distillation framework that fuses variational score distillation and consistency distillation, complemented by latent reward fine-tuning. The approach enables few-step generation (as few as 1–4 steps) while preserving high visual quality and semantic alignment, demonstrated on 10-second videos (128 frames at 12 FPS). It achieves state-of-the-art VBench scores (82.57) and substantial inference speedups (up to ×278.6) over the teacher, with human evaluations confirming improvements in visual quality and text-video alignment. The latent reward model enables reward-based fine-tuning without requiring differentiable pixel-space rewards or backprop through large decoders, delivering memory-efficient, flexible optimization across image, video, and text-conditioned rewards. These components collectively push toward near-real-time, high-quality T2V generation and offer a practical path for customization via reward metrics.

Abstract

Diffusion probabilistic models have shown significant progress in video generation; however, their computational efficiency is limited by the large number of sampling steps required. Reducing sampling steps often compromises video quality or generation diversity. In this work, we introduce a distillation method that combines variational score distillation and consistency distillation to achieve few-step video generation, maintaining both high quality and diversity. We also propose a latent reward model fine-tuning approach to further enhance video generation performance according to any specified reward metric. This approach reduces memory usage and does not require the reward to be differentiable. Our method demonstrates state-of-the-art performance in few-step generation for 10-second videos (128 frames at 12 FPS). The distilled student model achieves a score of 82.57 on VBench, surpassing the teacher model as well as baseline models Gen-3, T2V-Turbo, and Kling. One-step distillation accelerates the teacher model's diffusion sampling by up to 278.6 times, enabling near real-time generation. Human evaluations further validate the superior performance of our 4-step student models compared to teacher model using 50-step DDIM sampling.

Paper Structure

This paper contains 64 sections, 25 equations, 29 figures, 18 tables, 2 algorithms.

Figures (29)

  • Figure 1: By incorporating variational score distillation, consistency distillation and latent reward fine-tuning, our method generates high-quality videos with 4-step sampling, $\times15.6$ acceleration compared with teacher. More visualized examples see Appendix Sec. \ref{['app_sec:visual']} and https://quantumiracle.github.io/dollar/.
  • Figure 2: Method Overview: The few-step generator $G_\theta$ is trained to generate high-quality samples from random noise in latent space, guided by a combination of variational score distillation (VSD), consistency distillation (CD), and latent reward model (LRM) fine-tuning objectives. VSD loss enhances sample quality, albeit with a risk of mode collapse, while CD loss increases sample diversity without compromising generation quality. The LRM enables reward-based optimization to further improve sample quality, by bypassing the large, pixel-space reward model and the decoder, thereby reducing memory usage and removing the need for differentiable reward models.
  • Figure 3: Demonstration of the conjugate velocity prediction: relationship of $v$-prediction for diffusion and rectified flow.
  • Figure 4: Visualization of samples in training dataset (left) and samples generated with reward tuning using HPSv2 reward (right).
  • Figure 5: Comparison of different reward fine-tuning methods: (1) Direct reward gradient methods are limited to small reward or video models or short input sequences, and they also require a differentiable reward model; (2) The latent reward model is compact and bypasses the decoder for gradient-based optimization, making it suitable when large reward models and decoders exceed available VRAM; (3) DDPO is similarly constrained by VRAM limits when handling large video models and tracking log-probabilities of samples over multiple steps.
  • ...and 24 more figures