Table of Contents
Fetching ...

Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferences

Weijian Luo

TL;DR

Diff-Instruct++ introduces a data-free framework for aligning one-step text-to-image generators with human preferences by optimizing a reward-based objective with an Integral KL divergence to a reference diffusion process. The method unifies diffusion-distillation theory with RLHF concepts, revealing that classifier-free guidance implicitly performs RLHF and providing a practical training loop with a teacher TA. Empirical results show that DI++-aligned DiT-based one-step models achieve state-of-the-art human-preference scores (e.g., HPSv2.0 up to 28.48) and strong image-reward and aesthetic metrics, while maintaining fast convergence and low data requirements. The approach demonstrates strong zero-shot generalization and superiority over several open-source and few-step baselines, though it also exposes limitations such as occasional control weaknesses and artifacts at higher CFG/reward scales, guiding future improvements in safety and reliability. Overall, DI++ provides a scalable, efficient pathway to human-aligned one-step image generation with practical implications for safer, more usable AI-powered content creation.

Abstract

One-step text-to-image generator models offer advantages such as swift inference efficiency, flexible architectures, and state-of-the-art generation performance. In this paper, we study the problem of aligning one-step generator models with human preferences for the first time. Inspired by the success of reinforcement learning using human feedback (RLHF), we formulate the alignment problem as maximizing expected human reward functions while adding an Integral Kullback-Leibler divergence term to prevent the generator from diverging. By overcoming technical challenges, we introduce Diff-Instruct++ (DI++), the first, fast-converging and image data-free human preference alignment method for one-step text-to-image generators. We also introduce novel theoretical insights, showing that using CFG for diffusion distillation is secretly doing RLHF with DI++. Such an interesting finding brings understanding and potential contributions to future research involving CFG. In the experiment sections, we align both UNet-based and DiT-based one-step generators using DI++, which use the Stable Diffusion 1.5 and the PixelArt-$α$ as the reference diffusion processes. The resulting DiT-based one-step text-to-image model achieves a strong Aesthetic Score of 6.19 and an Image Reward of 1.24 on the COCO validation prompt dataset. It also achieves a leading Human preference Score (HPSv2.0) of 28.48, outperforming other open-sourced models such as Stable Diffusion XL, DMD2, SD-Turbo, as well as PixelArt-$α$. Both theoretical contributions and empirical evidence indicate that DI++ is a strong human-preference alignment approach for one-step text-to-image models. The homepage of the paper is https://github.com/pkulwj1994/diff_instruct_pp.

Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferences

TL;DR

Diff-Instruct++ introduces a data-free framework for aligning one-step text-to-image generators with human preferences by optimizing a reward-based objective with an Integral KL divergence to a reference diffusion process. The method unifies diffusion-distillation theory with RLHF concepts, revealing that classifier-free guidance implicitly performs RLHF and providing a practical training loop with a teacher TA. Empirical results show that DI++-aligned DiT-based one-step models achieve state-of-the-art human-preference scores (e.g., HPSv2.0 up to 28.48) and strong image-reward and aesthetic metrics, while maintaining fast convergence and low data requirements. The approach demonstrates strong zero-shot generalization and superiority over several open-source and few-step baselines, though it also exposes limitations such as occasional control weaknesses and artifacts at higher CFG/reward scales, guiding future improvements in safety and reliability. Overall, DI++ provides a scalable, efficient pathway to human-aligned one-step image generation with practical implications for safer, more usable AI-powered content creation.

Abstract

One-step text-to-image generator models offer advantages such as swift inference efficiency, flexible architectures, and state-of-the-art generation performance. In this paper, we study the problem of aligning one-step generator models with human preferences for the first time. Inspired by the success of reinforcement learning using human feedback (RLHF), we formulate the alignment problem as maximizing expected human reward functions while adding an Integral Kullback-Leibler divergence term to prevent the generator from diverging. By overcoming technical challenges, we introduce Diff-Instruct++ (DI++), the first, fast-converging and image data-free human preference alignment method for one-step text-to-image generators. We also introduce novel theoretical insights, showing that using CFG for diffusion distillation is secretly doing RLHF with DI++. Such an interesting finding brings understanding and potential contributions to future research involving CFG. In the experiment sections, we align both UNet-based and DiT-based one-step generators using DI++, which use the Stable Diffusion 1.5 and the PixelArt- as the reference diffusion processes. The resulting DiT-based one-step text-to-image model achieves a strong Aesthetic Score of 6.19 and an Image Reward of 1.24 on the COCO validation prompt dataset. It also achieves a leading Human preference Score (HPSv2.0) of 28.48, outperforming other open-sourced models such as Stable Diffusion XL, DMD2, SD-Turbo, as well as PixelArt-. Both theoretical contributions and empirical evidence indicate that DI++ is a strong human-preference alignment approach for one-step text-to-image models. The homepage of the paper is https://github.com/pkulwj1994/diff_instruct_pp.

Paper Structure

This paper contains 57 sections, 4 theorems, 40 equations, 5 figures, 3 tables, 3 algorithms.

Key Result

Theorem 3.1

The $\theta$ gradient of the objective eqn:gen_rlhf is

Figures (5)

  • Figure 1: Images generated by a one-step text-to-image generator that has been aligned with human preferences using Diff-Instruct++. We put the prompt in Appendix \ref{['app:prompts_demo']}.
  • Figure 2: A demonstration of three stages for training a one-step text-to-image generator model that is aligned with human preference. The pre-training stage (the leftmost column) pre-trains the reference diffusion model as well as the one-step generator. The reward modeling stage (the middle column) trains the reward model using human preference data. The alignment stage (the rightmost column) uses a pre-trained reference diffusion model, the reward model, and a TA diffusion model to align the one-step generator with human preference.
  • Figure 3: A qualitative comparison of one-step generator models aligned using Diff-Instruct++ with different configurations. The bottom row is the weakest setting with no human preference alignment. Upper rows are models that are aligned stronger in a progressive way. We put the prompts to generate images in Appendix \ref{['app:qualitative_prompt']}. The generated images are more and more aesthetic with stronger human preference alignments.
  • Figure 4: Qualitative comparison of our our Diff-Instruct++ aligned models against other few-step text-to-image models in Table \ref{['quantitative']}. The left three columns are randomly placed, with one generated by PixelArt-$\alpha$ model with 30 steps, one generated by a one-step model aligned with Diff-Instruct++ with a CFG scale of 4.5 and reward scale of 1.0, and another generated by a one-step model aligned with 4.5 CFG and 10.0 reward. Please zoom in to check details, lighting, and aesthetic performances. Could you please tell us which one you like the best? We put the answer for each image and prompts for three rows in Appendix \ref{['app:prompts_fewstep']}.
  • Figure 5: Bad generation cases by aligned one-step generator model (4.5 CFG + 1.0 reward).

Theorems & Definitions (6)

  • Theorem 3.1
  • Theorem 3.2
  • Remark 3.3
  • Theorem 3.4
  • Remark 3.5
  • Lemma B.1: Pseudo Loss Function