Table of Contents
Fetching ...

IPO: Iterative Preference Optimization for Text-to-Video Generation

Xiaomeng Yang, Zhiyu Tan, Hao Li

TL;DR

This work introduces Iterative Preference Optimization (IPO), a post-training framework to align text-to-video generation with human preferences. IPO trains a critic model on a human-annotated preference dataset to automatically label generated videos, enabling multi-round optimization using diffusion-based DPO or KTO objectives while incorporating real-video data for regularization. The approach yields improvements in subject consistency, motion smoothness, and aesthetic quality, with a 2B-parameter model surpassing a 5B baseline on VBench, highlighting the efficiency and scalability of iterative preference signals. By reducing manual labeling and enabling iterative refinement, IPO offers a practical path to high-quality, human-aligned video generation and sets a new state-of-the-art on benchmark hard metrics.

Abstract

Video foundation models have achieved significant advancement with the help of network upgrade as well as model scale-up. However, they are still hard to meet requirements of applications due to unsatisfied generation quality. To solve this problem, we propose to align video foundation models with human preferences from the perspective of post-training in this paper. Consequently, we introduce an Iterative Preference Optimization strategy to enhance generated video quality by incorporating human feedback. Specifically, IPO exploits a critic model to justify video generations for pairwise ranking as in Direct Preference Optimization or point-wise scoring as in Kahneman-Tversky Optimization. Given this, IPO optimizes video foundation models with guidance of signals from preference feedback, which helps improve generated video quality in subject consistency, motion smoothness and aesthetic quality, etc. In addition, IPO incorporates the critic model with the multi-modality large language model, which enables it to automatically assign preference labels without need of retraining or relabeling. In this way, IPO can efficiently perform multi-round preference optimization in an iterative manner, without the need of tediously manual labeling. Comprehensive experiments demonstrate that the proposed IPO can effectively improve the video generation quality of a pretrained model and help a model with only 2B parameters surpass the one with 5B parameters. Besides, IPO achieves new state-of-the-art performance on VBench benchmark.

IPO: Iterative Preference Optimization for Text-to-Video Generation

TL;DR

This work introduces Iterative Preference Optimization (IPO), a post-training framework to align text-to-video generation with human preferences. IPO trains a critic model on a human-annotated preference dataset to automatically label generated videos, enabling multi-round optimization using diffusion-based DPO or KTO objectives while incorporating real-video data for regularization. The approach yields improvements in subject consistency, motion smoothness, and aesthetic quality, with a 2B-parameter model surpassing a 5B baseline on VBench, highlighting the efficiency and scalability of iterative preference signals. By reducing manual labeling and enabling iterative refinement, IPO offers a practical path to high-quality, human-aligned video generation and sets a new state-of-the-art on benchmark hard metrics.

Abstract

Video foundation models have achieved significant advancement with the help of network upgrade as well as model scale-up. However, they are still hard to meet requirements of applications due to unsatisfied generation quality. To solve this problem, we propose to align video foundation models with human preferences from the perspective of post-training in this paper. Consequently, we introduce an Iterative Preference Optimization strategy to enhance generated video quality by incorporating human feedback. Specifically, IPO exploits a critic model to justify video generations for pairwise ranking as in Direct Preference Optimization or point-wise scoring as in Kahneman-Tversky Optimization. Given this, IPO optimizes video foundation models with guidance of signals from preference feedback, which helps improve generated video quality in subject consistency, motion smoothness and aesthetic quality, etc. In addition, IPO incorporates the critic model with the multi-modality large language model, which enables it to automatically assign preference labels without need of retraining or relabeling. In this way, IPO can efficiently perform multi-round preference optimization in an iterative manner, without the need of tediously manual labeling. Comprehensive experiments demonstrate that the proposed IPO can effectively improve the video generation quality of a pretrained model and help a model with only 2B parameters surpass the one with 5B parameters. Besides, IPO achieves new state-of-the-art performance on VBench benchmark.

Paper Structure

This paper contains 42 sections, 12 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Comparison of (a) traditional single-round preference optimization and (b) the proposed Iterative Preference Optimization for aligning video foundation model with human feedback. IPO introduces a critic model to automatically labeling video data without need of tedious manual efforts, making iterative update possible in an efficient way.
  • Figure 2: The overview of our proposed Iterative Preference Optimization framework. IPO consists of three major parts: (a) Human Preference Dataset. It is used to train a critic model. This dataset only requires one-round manual annotation without need of relabeling in the iterative framework; (b) Critic Model. It is learned to automatically annotate generated videos with preference labels, which help eliminate tedious manual efforts for multi-round optimization; (c) Iterative Optimization. It incorporates the critical model to iteratively optimize video foundation model. In this way, IPO can efficiently tune the base model to align with human preference and enhance generation ability in aspects of subject consistency, motion smoothness and aesthetic quality, etc. Best viewed in 2x zoom.
  • Figure 3: The distribution statistics of Human Preference Dataset on prompts in (a) and (b) as well as scoring/ranking in (c) and (d).
  • Figure 4: Qualitative Comparison of our proposed IPO model and the baseline CogVideoX-2B model. We can see IPO improves the baseline in the aspects of prompt following (the 1st line), aesthetic quality (the 2nd line), subject consistency (the 3rd line) and motion smoothness (the 4th line). This demonstrates the efffectiveness of our IPO for enhancing the generation capability of video found models. Additional visualization results are available in the supplementary materials.
  • Figure 5: Human evaluation of IPO and CogVideoX-2B.
  • ...and 2 more figures