Table of Contents
Fetching ...

HuViDPO:Enhancing Video Generation through Direct Preference Optimization for Human-Centric Alignment

Lifan Jiang, Boxi Wu, Jiahui Zhang, Xiaotong Guan, Shuang Chen

TL;DR

This paper tackles the lack of a principled objective for aligning text-to-video generation with human preferences and the data scarcity hindering such alignment. It introduces HuViDPO, the first approach to apply Direct Preference Optimization to T2V by deriving a video-specific loss $\mathcal{L}_{\text{Video}}(\theta)$ and enabling preference-guided fine-tuning without a separate reward model. The method combines a two-stage, LoRA-based fine-tuning on small, action-specific datasets, a First-Frame-Conditioned generation strategy using DPO-SDXL, and an enhanced SparseCausal-Attention module to boost spatiotemporal consistency and diversity. Empirical results across eight action categories show improved aesthetics, alignment with human preferences, and temporal coherence compared with baselines, with efficient training on a single 24G GPU and accessible deployment.

Abstract

With the rapid development of AIGC technology, significant progress has been made in diffusion model-based technologies for text-to-image (T2I) and text-to-video (T2V). In recent years, a few studies have introduced the strategy of Direct Preference Optimization (DPO) into T2I tasks, significantly enhancing human preferences in generated images. However, existing T2V generation methods lack a well-formed pipeline with exact loss function to guide the alignment of generated videos with human preferences using DPO strategies. Additionally, challenges such as the scarcity of paired video preference data hinder effective model training. At the same time, the lack of training datasets poses a risk of insufficient flexibility and poor video generation quality in the generated videos. Based on those problems, our work proposes three targeted solutions in sequence. 1) Our work is the first to introduce the DPO strategy into the T2V tasks. By deriving a carefully structured loss function, we utilize human feedback to align video generation with human preferences. We refer to this new method as HuViDPO. 2) Our work constructs small-scale human preference datasets for each action category and fine-tune this model, improving the aesthetic quality of the generated videos while reducing training costs. 3) We adopt a First-Frame-Conditioned strategy, leveraging the rich in formation from the first frame to guide the generation of subsequent frames, enhancing flexibility in video generation. At the same time, we employ a SparseCausal Attention mechanism to enhance the quality of the generated videos.More details and examples can be accessed on our website: https://tankowa.github.io/HuViDPO. github.io/.

HuViDPO:Enhancing Video Generation through Direct Preference Optimization for Human-Centric Alignment

TL;DR

This paper tackles the lack of a principled objective for aligning text-to-video generation with human preferences and the data scarcity hindering such alignment. It introduces HuViDPO, the first approach to apply Direct Preference Optimization to T2V by deriving a video-specific loss and enabling preference-guided fine-tuning without a separate reward model. The method combines a two-stage, LoRA-based fine-tuning on small, action-specific datasets, a First-Frame-Conditioned generation strategy using DPO-SDXL, and an enhanced SparseCausal-Attention module to boost spatiotemporal consistency and diversity. Empirical results across eight action categories show improved aesthetics, alignment with human preferences, and temporal coherence compared with baselines, with efficient training on a single 24G GPU and accessible deployment.

Abstract

With the rapid development of AIGC technology, significant progress has been made in diffusion model-based technologies for text-to-image (T2I) and text-to-video (T2V). In recent years, a few studies have introduced the strategy of Direct Preference Optimization (DPO) into T2I tasks, significantly enhancing human preferences in generated images. However, existing T2V generation methods lack a well-formed pipeline with exact loss function to guide the alignment of generated videos with human preferences using DPO strategies. Additionally, challenges such as the scarcity of paired video preference data hinder effective model training. At the same time, the lack of training datasets poses a risk of insufficient flexibility and poor video generation quality in the generated videos. Based on those problems, our work proposes three targeted solutions in sequence. 1) Our work is the first to introduce the DPO strategy into the T2V tasks. By deriving a carefully structured loss function, we utilize human feedback to align video generation with human preferences. We refer to this new method as HuViDPO. 2) Our work constructs small-scale human preference datasets for each action category and fine-tune this model, improving the aesthetic quality of the generated videos while reducing training costs. 3) We adopt a First-Frame-Conditioned strategy, leveraging the rich in formation from the first frame to guide the generation of subsequent frames, enhancing flexibility in video generation. At the same time, we employ a SparseCausal Attention mechanism to enhance the quality of the generated videos.More details and examples can be accessed on our website: https://tankowa.github.io/HuViDPO. github.io/.

Paper Structure

This paper contains 23 sections, 25 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Videos generated by our HuViDPO. Videos generated by HuViDPO show improved flexibility, quality, and better alignment with human preferences. Green highlights indicate the action category, blue highlights represent motion-related information, and orange highlights denote the background. You can see more examples in the supplementary material.More details and examples can be accessed on our website: https://tankowa.github.io/HuViDPO.github.io/.
  • Figure 2: Training pipeline of our HuViDPO. Training process can be divided into two stages: (a) Training the Attention Block and Temporal-Spatial layers using basic training settings to improve the spatiotemporal consistency. (b) Fine-tuning the model, with LoRA added and other layers frozen, using small-scale human preference datasets and DPO strategy to enhance its alignment with human preferences. In phase (b), $loss_w$ and $loss_l$ denote the loss values computed by inputting winning and losing videos into the fine-tuned model, while $loss_{wref}$ and $loss_{lref}$ are the loss values obtained by inputting the same videos into the reference model.
  • Figure 3: The details of the proposed SparseCausal-Attention Mechanism. We extract $K/V$ tokens from the first frame and the $i-1$ frame, and compute the attention mechanism with the $Q$ of the $i$ frame.
  • Figure 4: Inference Process of our HuViDPO. We first employ DPO-SDXL to create a diverse style-first frame, then concatenate it with other noise frames, which are then inputted into a trained model to produce the video output.
  • Figure 5: Qualitative comparison with LVDM he2022latent, AnimateDiff guo2023animatediff, and LAMP wu2023lamp. The above images clearly demonstrate that the videos generated by our method exhibit better spatiotemporal consistency and are more visually aligned with human preferences. In example (a) of this image, video generated by our method exhibits richer content and stronger spatiotemporal consistency, such as the rainbow and the shape of the bridge remaining largely unchanged. In example (b) of this image, video generated by our method features a more reasonable layout and richer main subjects, such as more detailed character features and a softer background.
  • ...and 8 more figures