Table of Contents
Fetching ...

Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization

Yi Gu, Zhendong Wang, Yueqin Yin, Yujia Xie, Mingyuan Zhou

TL;DR

This work presents Diffusion-RPO, a diffusion-based preference-learning method that aligns text-to-image models with human preferences by performing stepwise optimization and applying contrastive, multi-modal weights. It extends Relative Preference Optimization to diffusion models, leveraging offline data sampling and CLIP-based embeddings to compare both identical and related prompts, and introduces Style Alignment as a cost-effective evaluation metric. Empirical results on SD1.5 and SDXL show Diffusion-RPO outperforms Diffusion-DPO and SFT across automated human-preference metrics and style alignment, with ablations shedding light on the effect of the embedding-distance temperature. The approach also yields publicly mentionable datasets for style transfer evaluation, enabling more interpretable and reproducible assessment of preference learning in diffusion-based T2I models.

Abstract

Aligning large language models with human preferences has emerged as a critical focus in language modeling research. Yet, integrating preference learning into Text-to-Image (T2I) generative models is still relatively uncharted territory. The Diffusion-DPO technique made initial strides by employing pairwise preference learning in diffusion models tailored for specific text prompts. We introduce Diffusion-RPO, a new method designed to align diffusion-based T2I models with human preferences more effectively. This approach leverages both prompt-image pairs with identical prompts and those with semantically related content across various modalities. Furthermore, we have developed a new evaluation metric, style alignment, aimed at overcoming the challenges of high costs, low reproducibility, and limited interpretability prevalent in current evaluations of human preference alignment. Our findings demonstrate that Diffusion-RPO outperforms established methods such as Supervised Fine-Tuning and Diffusion-DPO in tuning Stable Diffusion versions 1.5 and XL-1.0, achieving superior results in both automated evaluations of human preferences and style alignment. Our code is available at https://github.com/yigu1008/Diffusion-RPO

Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization

TL;DR

This work presents Diffusion-RPO, a diffusion-based preference-learning method that aligns text-to-image models with human preferences by performing stepwise optimization and applying contrastive, multi-modal weights. It extends Relative Preference Optimization to diffusion models, leveraging offline data sampling and CLIP-based embeddings to compare both identical and related prompts, and introduces Style Alignment as a cost-effective evaluation metric. Empirical results on SD1.5 and SDXL show Diffusion-RPO outperforms Diffusion-DPO and SFT across automated human-preference metrics and style alignment, with ablations shedding light on the effect of the embedding-distance temperature. The approach also yields publicly mentionable datasets for style transfer evaluation, enabling more interpretable and reproducible assessment of preference learning in diffusion-based T2I models.

Abstract

Aligning large language models with human preferences has emerged as a critical focus in language modeling research. Yet, integrating preference learning into Text-to-Image (T2I) generative models is still relatively uncharted territory. The Diffusion-DPO technique made initial strides by employing pairwise preference learning in diffusion models tailored for specific text prompts. We introduce Diffusion-RPO, a new method designed to align diffusion-based T2I models with human preferences more effectively. This approach leverages both prompt-image pairs with identical prompts and those with semantically related content across various modalities. Furthermore, we have developed a new evaluation metric, style alignment, aimed at overcoming the challenges of high costs, low reproducibility, and limited interpretability prevalent in current evaluations of human preference alignment. Our findings demonstrate that Diffusion-RPO outperforms established methods such as Supervised Fine-Tuning and Diffusion-DPO in tuning Stable Diffusion versions 1.5 and XL-1.0, achieving superior results in both automated evaluations of human preferences and style alignment. Our code is available at https://github.com/yigu1008/Diffusion-RPO
Paper Structure (32 sections, 27 equations, 7 figures, 5 tables)

This paper contains 32 sections, 27 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Diffusion-RPO represents a novel approach that aligns Text-to-Image models with human preferences by optimizing diffusion model sampling steps and applying contrastive weighting to similar prompt-image pairs. As demonstrated by the samples above, the Diffusion-RPO fine-tuned SDXL-1.0 model successfully generates images that closely align with human preferences. The list of prompts is provided in the Appendix.
  • Figure 2: Sample images from RPO-SDXL The prompts used to generate the images are: "A whimsical candy maker in her enchanted workshop, surrounded by a cascade of multicolored candies falling like rain, wearing a bright, patchwork dress, her hair tinted with streaks of pink and blue, 8K, hyper-realistic, cinematic, post-production.", "A samurai cloaked in white with swords stands in a light beam of a dark cave, with a ruby red sorrow evident in the image.", "Victorian genre painting portrait of Royal Dano, an old west character in fantasy costume, against a red background.", and "A colorful anime painting of a sugar glider with a hiphop graffiti theme, by several artists, currently trending on Artstation.","A typhoon in a tea cup, digital render"
  • Figure 3: Example Images from Pick-a-pic, Van Gogh, Sketch and Winter Datasets
  • Figure 4: Sample images from Style Aligned Stable Diffusion Models, the images are generated from prompts: "Edelgard from Fire Emblem depicted in Artgerm's style.", "Portrait of Archduke Franz Ferdinand by Charlotte Grimm, depicting his detailed face."
  • Figure 5: Sample images from RPO-SDXL The prompts used to generate the images are: "A charismatic chef in a bustling kitchen, his apron dusted with flour, smiling as he presents a beautifully prepared dish. 8K, hyper-realistic, cinematic, post-production.","A rebellious teenage boy with spiked, vibrant red hair, riding a futuristic motorcycle through neon-lit city streets, headphones around his neck blaring electronic music. Best quality, fine details.","An oil painting of an anthropomorphic fox overlooking a village in the moor.", "A fantasy-themed portrait of a female elf with golden hair and violet eyes, her attire shimmering with iridescent colors, set in an enchanted forest. 8K, best quality, fine details.","A graphic poster featuring an avocado and raspberry observing a burning world, inspired by old botanical illustrations, Matisse, Caravaggio, Basquiat, and Japanese art."
  • ...and 2 more figures