Table of Contents
Fetching ...

Curriculum Direct Preference Optimization for Diffusion and Consistency Models

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Nicu Sebe, Mubarak Shah

TL;DR

This work tackles the problem of aligning text-to-image generation with human preferences by introducing Curriculum Direct Preference Optimization (Curriculum DPO) for diffusion and consistency models. It combines reward-model ranking per prompt with a curriculum that gradually increases pair difficulty, then fine-tunes via Direct Preference Optimization, and it additionally adapts DPO to consistency models (Consistency-DPO). Across nine benchmarks and with two base models (Stable Diffusion and Latent Consistency Model), Curriculum DPO outperforms state-of-the-art methods in text alignment, aesthetics, and human preference, with human studies validating the improvements and notable data efficiency. The approach offers a practical, scalable path to more human-aligned image synthesis, reducing data requirements while improving output quality and alignment with nuanced preferences.

Abstract

Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). In this paper, we propose a novel and enhanced version of DPO based on curriculum learning for text-to-image generation. Our method is divided into two training stages. First, a ranking of the examples generated for each prompt is obtained by employing a reward model. Then, increasingly difficult pairs of examples are sampled and provided to a text-to-image generative (diffusion or consistency) model. Generated samples that are far apart in the ranking are considered to form easy pairs, while those that are close in the ranking form hard pairs. In other words, we use the rank difference between samples as a measure of difficulty. The sampled pairs are split into batches according to their difficulty levels, which are gradually used to train the generative model. Our approach, Curriculum DPO, is compared against state-of-the-art fine-tuning approaches on nine benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at https://github.com/CroitoruAlin/Curriculum-DPO.

Curriculum Direct Preference Optimization for Diffusion and Consistency Models

TL;DR

This work tackles the problem of aligning text-to-image generation with human preferences by introducing Curriculum Direct Preference Optimization (Curriculum DPO) for diffusion and consistency models. It combines reward-model ranking per prompt with a curriculum that gradually increases pair difficulty, then fine-tunes via Direct Preference Optimization, and it additionally adapts DPO to consistency models (Consistency-DPO). Across nine benchmarks and with two base models (Stable Diffusion and Latent Consistency Model), Curriculum DPO outperforms state-of-the-art methods in text alignment, aesthetics, and human preference, with human studies validating the improvements and notable data efficiency. The approach offers a practical, scalable path to more human-aligned image synthesis, reducing data requirements while improving output quality and alignment with nuanced preferences.

Abstract

Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). In this paper, we propose a novel and enhanced version of DPO based on curriculum learning for text-to-image generation. Our method is divided into two training stages. First, a ranking of the examples generated for each prompt is obtained by employing a reward model. Then, increasingly difficult pairs of examples are sampled and provided to a text-to-image generative (diffusion or consistency) model. Generated samples that are far apart in the ranking are considered to form easy pairs, while those that are close in the ranking form hard pairs. In other words, we use the rank difference between samples as a measure of difficulty. The sampled pairs are split into batches according to their difficulty levels, which are gradually used to train the generative model. Our approach, Curriculum DPO, is compared against state-of-the-art fine-tuning approaches on nine benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at https://github.com/CroitoruAlin/Curriculum-DPO.
Paper Structure (18 sections, 21 equations, 8 figures, 6 tables)

This paper contains 18 sections, 21 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: An overview of Curriculum DPO. Images generated by a diffusion / consistency model and their prompts are passed through a reward model, obtaining a preference ranking for each prompt. Next, image pairs of various difficulty levels are generated and organized into batches, such that the initial batch contains easy pairs (with high difference in terms of preference scores) and subsequent batches contain increasingly difficult pairs (the difference in terms of preference is gradually decreased). The diffusion / consistency model is finally trained via Direct Preference Optimization (DPO) based on curriculum learning. Best viewed in color.
  • Figure 2: Qualitative results on dataset $D_1$, before and after fine-tuning for the text alignment task. The fine-tuning alternatives are: DDPO, DPO and Curriculum DPO. Best viewed in color.
  • Figure 3: Ablation results obtained by varying the hyperparameter $\beta$ for Consistency-DPO (in blue), the number of training iterations per batch $K$ for Curriculum DPO (in blue), the number of batches $B$ for Curriculum DPO (in blue), and the number of training images per prompt $M$ (in blue). Fine-tuned LCM models are compared with the pre-trained LCM baseline on the human preference and visual appeal tasks (where scores are given by the HPSv2 reward model and LAION Aesthetics Predictor respectively).
  • Figure 4: Qualitative results before and after fine-tuning for the text alignment task on DrawBench. The fine-tuning methods are: DDPO, DPO, Naive DPO and Curriculum DPO. Best viewed in color.
  • Figure 5: Qualitative results after fine-tuning with HPSv2 as the reward model (human preference). The fine-tuning alternatives are: DDPO, DPO and Curriculum DPO. Best viewed in color.
  • ...and 3 more figures