Table of Contents
Fetching ...

Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Nicu Sebe, Mubarak Shah

TL;DR

Curriculum-DPO++ addresses the misalignment between data difficulty and model capacity in preference-guided fine-tuning for diffusion and consistency models. It integrates data-level curricula with a model-level curriculum that expands capacity via progressive unfreezing of layers and increasing LoRA rank, and it also introduces a reward-model-free variant using prompt embedding masking. Across nine benchmarks, Curriculum-DPO++ delivers consistent gains in text alignment, aesthetics, and human preference over Curriculum-DPO and other baselines, demonstrating improved sample efficiency and robustness. The work has practical impact for more aligned text-to-image generation and suggests potential extensions to NLP and other modalities.

Abstract

Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). However, neither RLHF nor DPO take into account the fact that learning certain preferences is more difficult than learning other preferences, rendering the optimization process suboptimal. To address this gap in text-to-image generation, we recently proposed Curriculum-DPO, a method that organizes image pairs by difficulty. In this paper, we introduce Curriculum-DPO++, an enhanced method that combines the original data-level curriculum with a novel model-level curriculum. More precisely, we propose to dynamically increase the learning capacity of the denoising network as training advances. We implement this capacity increase via two mechanisms. First, we initialize the model with only a subset of the trainable layers used in the original Curriculum-DPO. As training progresses, we sequentially unfreeze layers until the configuration matches the full baseline architecture. Second, as the fine-tuning is based on Low-Rank Adaptation (LoRA), we implement a progressive schedule for the dimension of the low-rank matrices. Instead of maintaining a fixed capacity, we initialize the low-rank matrices with a dimension significantly smaller than that of the baseline. As training proceeds, we incrementally increase their rank, allowing the capacity to grow until it converges to the same rank value as in Curriculum-DPO. Furthermore, we propose an alternative ranking strategy to the one employed by Curriculum-DPO. Finally, we compare Curriculum-DPO++ against Curriculum-DPO and other state-of-the-art preference optimization approaches on nine benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at https://github.com/CroitoruAlin/Curriculum-DPO.

Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation

TL;DR

Curriculum-DPO++ addresses the misalignment between data difficulty and model capacity in preference-guided fine-tuning for diffusion and consistency models. It integrates data-level curricula with a model-level curriculum that expands capacity via progressive unfreezing of layers and increasing LoRA rank, and it also introduces a reward-model-free variant using prompt embedding masking. Across nine benchmarks, Curriculum-DPO++ delivers consistent gains in text alignment, aesthetics, and human preference over Curriculum-DPO and other baselines, demonstrating improved sample efficiency and robustness. The work has practical impact for more aligned text-to-image generation and suggests potential extensions to NLP and other modalities.

Abstract

Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). However, neither RLHF nor DPO take into account the fact that learning certain preferences is more difficult than learning other preferences, rendering the optimization process suboptimal. To address this gap in text-to-image generation, we recently proposed Curriculum-DPO, a method that organizes image pairs by difficulty. In this paper, we introduce Curriculum-DPO++, an enhanced method that combines the original data-level curriculum with a novel model-level curriculum. More precisely, we propose to dynamically increase the learning capacity of the denoising network as training advances. We implement this capacity increase via two mechanisms. First, we initialize the model with only a subset of the trainable layers used in the original Curriculum-DPO. As training progresses, we sequentially unfreeze layers until the configuration matches the full baseline architecture. Second, as the fine-tuning is based on Low-Rank Adaptation (LoRA), we implement a progressive schedule for the dimension of the low-rank matrices. Instead of maintaining a fixed capacity, we initialize the low-rank matrices with a dimension significantly smaller than that of the baseline. As training proceeds, we incrementally increase their rank, allowing the capacity to grow until it converges to the same rank value as in Curriculum-DPO. Furthermore, we propose an alternative ranking strategy to the one employed by Curriculum-DPO. Finally, we compare Curriculum-DPO++ against Curriculum-DPO and other state-of-the-art preference optimization approaches on nine benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at https://github.com/CroitoruAlin/Curriculum-DPO.
Paper Structure (16 sections, 21 equations, 12 figures, 3 tables)

This paper contains 16 sections, 21 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: An overview of Curriculum-DPO++. Images generated by a diffusion / consistency model and their prompts are passed through a reward model, obtaining a preference ranking for each prompt. Next, image pairs of various difficulty levels are generated and organized into batches, such that the initial batch contains easy pairs (with high difference in terms of preference scores) and subsequent batches contain increasingly difficult pairs (the difference in terms of preference is gradually decreased). The diffusion / consistency model is trained via Direct Preference Optimization (DPO) based on easy-to-hard data-level curriculum. Curriculum-DPO++ gracefully combines data-level and model-level curricula, increasing the learning capacity of the model to accommodate for the more complex examples gradually introduced during training. The learning capacity is expanded by unfreezing neural layers and increasing the rank of LoRA matrices. Best viewed in color.
  • Figure 2: Qualitative results on dataset $D_1$, before and after fine-tuning for the text alignment task. The fine-tuning alternatives are DDPO, DPO, Curriculum-DPO, and Curriculum-DPO++. Best viewed in color.
  • Figure 3: Preference win rates according to all the three reward models. The comparison is conducted between all the four fine-tuning strategies and the pre-trained SD baseline. For this comparison, we use images generated for the Pick-a-Pic ($D_3$) test set of prompts. Best viewed in color.
  • Figure 4: Preference win rates according to all the three reward models. The comparison is conducted between all the four fine-tuning strategies and the pre-trained LCM baseline. For this comparison, we use images generated for the Pick-a-Pic ($D_3$) test set of prompts. Best viewed in color.
  • Figure 5: Ablation results obtained by (a) varying the hyperparameter $\beta$ for Consistency-DPO (in blue), (b) the number of training iterations per batch $K$ for Curriculum-DPO (in blue), (c) the number of batches $B$ for Curriculum-DPO (in blue), and (d) the number of training images per prompt $M$ (in blue). Fine-tuned LCM models are compared with the pre-trained LCM baseline on the human preference and visual appeal tasks, where scores are given by the HPSv2 reward model and LAION Aesthetics Predictor, respectively. Best viewed in color.
  • ...and 7 more figures