Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Preference Understanding
Kun Li, Jianhui Wang, Yangfan He, Xinyuan Song, Ruoyu Wang, Hongyang He, Wenxin Zhang, Jiaqi Chen, Keqin Li, Sida Li, Miao Zhang, Tianyu Shi, Xueqian Wang
TL;DR
This work tackles the challenge of aligning diffusion-based text-to-image outputs with fine-grained, evolving user preferences in multi-turn dialogues. It proposes Visual Co-Adaptation (VCA), a human-in-the-loop framework that combines a reward model trained to reflect human preferences with LoRA-based fine-tuning of diffusion, guided by a multi-turn prompt refinement and a three-objective reward design: $R_{div}$, $R_{cons}$, and $R_{MI}$, aggregated as $R_{total}(t)$. The authors prove a conditional convergence result showing the latent distribution $p(z_T)$ approaches the target with increasing rounds and demonstrate Pareto-optimal convergence under dynamic reward weighting; empirically, their approach outperforms baselines in user satisfaction and consistency on a large multi-turn dialogue dataset. The work also provides an interactive tool enabling non-experts to generate personalized, high-quality images, highlighting practical impact for accessible, preference-driven image synthesis.
Abstract
Generative AI has significantly changed industries by enabling text-driven image generation, yet challenges remain in achieving high-resolution outputs that align with fine-grained user preferences. Consequently, multi-round interactions are necessary to ensure the generated images meet expectations. Previous methods enhanced prompts via reward feedback but did not optimize over a multi-round dialogue dataset. In this work, we present a Visual Co-Adaptation (VCA) framework incorporating human-in-the-loop feedback, leveraging a well-trained reward model aligned with human preferences. Using a diverse multi-turn dialogue dataset, our framework applies multiple reward functions, such as diversity, consistency, and preference feedback, while fine-tuning the diffusion model through LoRA, thus optimizing image generation based on user input. We also construct multi-round dialogue datasets of prompts and image pairs aligned with user intent. Experiments demonstrate that our method outperforms state-of-the-art baselines, significantly improving image consistency and alignment with user intent. Our approach consistently surpasses competing models in user satisfaction, especially in multi-turn dialogue scenarios.
