Table of Contents
Fetching ...

RPO: Fine-Tuning Visual Generative Models via Rich Vision-Language Preferences

Hanyang Zhao, Haoxian Chen, Yucheng Guo, Genta Indra Winata, Tingting Ou, Ziyu Huang, David D. Yao, Wenpin Tang

TL;DR

Traditional reward-model-based preference tuning for visual generators often lacks transparency and is prone to reward hacking. Rich Preference Optimization (RPO) leverages rich critiques from Vision-Language Models to curate informative synthetic preference pairs and to extract actionable editing instructions, which are applied via an instruction-following image editor and ControlNet. The edited images are relabeled and used to fine-tune diffusion models with Diffusion-DPO, yielding significant data-efficiency gains and improved alignment across metrics and prompts. The approach demonstrates robust generalization to unseen prompts and styles, and it bridges Vision-Language critique with practical image-editing guidance for post-training alignment of visual generators.

Abstract

Traditional preference tuning methods for LLMs/Visual Generative Models often rely solely on reward model labeling, which can be opaque, offer limited insights into the rationale behind preferences, and are prone to issues such as reward hacking or overfitting. We introduce Rich Preference Optimization (RPO), a novel pipeline that leverages rich feedback signals from Vision Language Models (VLMs) to improve the curation of preference pairs for fine-tuning visual generative models like text-to-image diffusion models. Our approach begins with prompting VLMs to generate detailed critiques of synthesized images, from which we further prompt VLMs to extract reliable and actionable image editing instructions. By implementing these instructions, we create refined images, resulting in synthetic, informative preference pairs that serve as enhanced tuning datasets. We demonstrate the effectiveness of our pipeline and the resulting datasets in fine-tuning state-of-the-art diffusion models.

RPO: Fine-Tuning Visual Generative Models via Rich Vision-Language Preferences

TL;DR

Traditional reward-model-based preference tuning for visual generators often lacks transparency and is prone to reward hacking. Rich Preference Optimization (RPO) leverages rich critiques from Vision-Language Models to curate informative synthetic preference pairs and to extract actionable editing instructions, which are applied via an instruction-following image editor and ControlNet. The edited images are relabeled and used to fine-tune diffusion models with Diffusion-DPO, yielding significant data-efficiency gains and improved alignment across metrics and prompts. The approach demonstrates robust generalization to unseen prompts and styles, and it bridges Vision-Language critique with practical image-editing guidance for post-training alignment of visual generators.

Abstract

Traditional preference tuning methods for LLMs/Visual Generative Models often rely solely on reward model labeling, which can be opaque, offer limited insights into the rationale behind preferences, and are prone to issues such as reward hacking or overfitting. We introduce Rich Preference Optimization (RPO), a novel pipeline that leverages rich feedback signals from Vision Language Models (VLMs) to improve the curation of preference pairs for fine-tuning visual generative models like text-to-image diffusion models. Our approach begins with prompting VLMs to generate detailed critiques of synthesized images, from which we further prompt VLMs to extract reliable and actionable image editing instructions. By implementing these instructions, we create refined images, resulting in synthetic, informative preference pairs that serve as enhanced tuning datasets. We demonstrate the effectiveness of our pipeline and the resulting datasets in fine-tuning state-of-the-art diffusion models.

Paper Structure

This paper contains 28 sections, 2 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Our RPO pipeline for curating informative preference pairs from images generated from the base diffusion models: (1) Rich Feedback/Critic generation by a VLM (for which we choose LLaVA-Critic-7B), (2) Actionable editing instruction generation based on the critiques by another VLM (for which we chose Qwen2.5-VL-8B-Instruct), (3) Instruction-following image editing from the generated editing instructions (for which we choose ControlNet), and (4) Diffusion DPO training using reward model filtered preference pairs.
  • Figure 2: Comparison of RichFB generated informative feedback and our adopted textual criticism generate by carefully prompting a capable VLM.
  • Figure 3: Comparisons of feedback approaches and VLM performance for enhanced image editing, evaluated by ImageReward, HPSv2 and PickScore.
  • Figure 4: Model performance evaluated by PickScore, ImageReward, Aesthetic, and HPSv2.
  • Figure 5: RPO improves DPO performance across data scales, prompt sets, and evaluation metrics. The x-axis indicates the number of Pick-a-Pic training samples used for fine-tuning. The first row presents in-distribution results on Pick-a-Pic, while the second row show out-of-distribution performance on the PartiPrompts prompt set. In all plots, stars indicate the performance of the full Diffusion-DPO model, based on publicly released checkpoints from wallace2023dpo, which were trained on nearly 1M offline preference pairs.
  • ...and 8 more figures