RPO: Fine-Tuning Visual Generative Models via Rich Vision-Language Preferences
Hanyang Zhao, Haoxian Chen, Yucheng Guo, Genta Indra Winata, Tingting Ou, Ziyu Huang, David D. Yao, Wenpin Tang
TL;DR
Traditional reward-model-based preference tuning for visual generators often lacks transparency and is prone to reward hacking. Rich Preference Optimization (RPO) leverages rich critiques from Vision-Language Models to curate informative synthetic preference pairs and to extract actionable editing instructions, which are applied via an instruction-following image editor and ControlNet. The edited images are relabeled and used to fine-tune diffusion models with Diffusion-DPO, yielding significant data-efficiency gains and improved alignment across metrics and prompts. The approach demonstrates robust generalization to unseen prompts and styles, and it bridges Vision-Language critique with practical image-editing guidance for post-training alignment of visual generators.
Abstract
Traditional preference tuning methods for LLMs/Visual Generative Models often rely solely on reward model labeling, which can be opaque, offer limited insights into the rationale behind preferences, and are prone to issues such as reward hacking or overfitting. We introduce Rich Preference Optimization (RPO), a novel pipeline that leverages rich feedback signals from Vision Language Models (VLMs) to improve the curation of preference pairs for fine-tuning visual generative models like text-to-image diffusion models. Our approach begins with prompting VLMs to generate detailed critiques of synthesized images, from which we further prompt VLMs to extract reliable and actionable image editing instructions. By implementing these instructions, we create refined images, resulting in synthetic, informative preference pairs that serve as enhanced tuning datasets. We demonstrate the effectiveness of our pipeline and the resulting datasets in fine-tuning state-of-the-art diffusion models.
