Table of Contents
Fetching ...

Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Preference Understanding

Kun Li, Jianhui Wang, Yangfan He, Xinyuan Song, Ruoyu Wang, Hongyang He, Wenxin Zhang, Jiaqi Chen, Keqin Li, Sida Li, Miao Zhang, Tianyu Shi, Xueqian Wang

TL;DR

This work tackles the challenge of aligning diffusion-based text-to-image outputs with fine-grained, evolving user preferences in multi-turn dialogues. It proposes Visual Co-Adaptation (VCA), a human-in-the-loop framework that combines a reward model trained to reflect human preferences with LoRA-based fine-tuning of diffusion, guided by a multi-turn prompt refinement and a three-objective reward design: $R_{div}$, $R_{cons}$, and $R_{MI}$, aggregated as $R_{total}(t)$. The authors prove a conditional convergence result showing the latent distribution $p(z_T)$ approaches the target with increasing rounds and demonstrate Pareto-optimal convergence under dynamic reward weighting; empirically, their approach outperforms baselines in user satisfaction and consistency on a large multi-turn dialogue dataset. The work also provides an interactive tool enabling non-experts to generate personalized, high-quality images, highlighting practical impact for accessible, preference-driven image synthesis.

Abstract

Generative AI has significantly changed industries by enabling text-driven image generation, yet challenges remain in achieving high-resolution outputs that align with fine-grained user preferences. Consequently, multi-round interactions are necessary to ensure the generated images meet expectations. Previous methods enhanced prompts via reward feedback but did not optimize over a multi-round dialogue dataset. In this work, we present a Visual Co-Adaptation (VCA) framework incorporating human-in-the-loop feedback, leveraging a well-trained reward model aligned with human preferences. Using a diverse multi-turn dialogue dataset, our framework applies multiple reward functions, such as diversity, consistency, and preference feedback, while fine-tuning the diffusion model through LoRA, thus optimizing image generation based on user input. We also construct multi-round dialogue datasets of prompts and image pairs aligned with user intent. Experiments demonstrate that our method outperforms state-of-the-art baselines, significantly improving image consistency and alignment with user intent. Our approach consistently surpasses competing models in user satisfaction, especially in multi-turn dialogue scenarios.

Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Preference Understanding

TL;DR

This work tackles the challenge of aligning diffusion-based text-to-image outputs with fine-grained, evolving user preferences in multi-turn dialogues. It proposes Visual Co-Adaptation (VCA), a human-in-the-loop framework that combines a reward model trained to reflect human preferences with LoRA-based fine-tuning of diffusion, guided by a multi-turn prompt refinement and a three-objective reward design: , , and , aggregated as . The authors prove a conditional convergence result showing the latent distribution approaches the target with increasing rounds and demonstrate Pareto-optimal convergence under dynamic reward weighting; empirically, their approach outperforms baselines in user satisfaction and consistency on a large multi-turn dialogue dataset. The work also provides an interactive tool enabling non-experts to generate personalized, high-quality images, highlighting practical impact for accessible, preference-driven image synthesis.

Abstract

Generative AI has significantly changed industries by enabling text-driven image generation, yet challenges remain in achieving high-resolution outputs that align with fine-grained user preferences. Consequently, multi-round interactions are necessary to ensure the generated images meet expectations. Previous methods enhanced prompts via reward feedback but did not optimize over a multi-round dialogue dataset. In this work, we present a Visual Co-Adaptation (VCA) framework incorporating human-in-the-loop feedback, leveraging a well-trained reward model aligned with human preferences. Using a diverse multi-turn dialogue dataset, our framework applies multiple reward functions, such as diversity, consistency, and preference feedback, while fine-tuning the diffusion model through LoRA, thus optimizing image generation based on user input. We also construct multi-round dialogue datasets of prompts and image pairs aligned with user intent. Experiments demonstrate that our method outperforms state-of-the-art baselines, significantly improving image consistency and alignment with user intent. Our approach consistently surpasses competing models in user satisfaction, especially in multi-turn dialogue scenarios.

Paper Structure

This paper contains 18 sections, 5 theorems, 86 equations, 10 figures, 2 tables, 1 algorithm.

Key Result

Theorem 3.1

Given a user feedback sequence $\{\nabla_{\text{feedback}}^{(t)}\}_{t=1}^T$ that generates prompt sequences $\{P_t\}_{t=1}^T$ via the language model $\mathcal{F}_{\text{LLM}}$, assume there exists an ideal prompt $P_{\text{target}}$ such that $\psi(P_{\text{target}})$ perfectly aligns with user inte where $\text{DM}^{(t)}$ is the diffusion model at round $t$, and $\psi(\cdot)$ is the prompt embedd

Figures (10)

  • Figure 1: The workflow demonstrates how human preferences guide text-to-image diffusion, with a DPO-trained reward model evaluating image-prompt alignment and PPO updating LoRA parameters while keeping the diffusion model fixed.
  • Figure 2: Overview of our multi-round dialogue generation process. (a) shows how prompts and feedback refine images over rounds. (b) compares multi-round user correction with single-round self-correction. (c) illustrates the diffusion process with LoRA layers and text embeddings. The total reward $R_{\text{total}}$ balances diversity, consistency, and mutual information across rounds.
  • Figure 3: Weight changes for the different reward components.
  • Figure 4: Comparison of Preference and CLIP scores across different models.
  • Figure 5: Win rates between all methods.
  • ...and 5 more figures

Theorems & Definitions (13)

  • Theorem 3.1: Conditional Convergence of Multi-Round Diffusion Process
  • Theorem 3.2: Global Optimality of Dynamic Reward Optimization
  • Theorem A.4: Conditional Convergence of Multi-Round Diffusion Process. Theorem \ref{['thms:conditional_convergence']}
  • proof
  • Definition A.6: Dynamically Weighted Total Reward Function
  • Definition A.7: Diversity Reward
  • Definition A.8: Consistency Reward
  • Definition A.9: Mutual Information Reward
  • Theorem A.10: Global Optimality of Dynamic Reward Optimization Theorem \ref{['theorems:global_optimality']}
  • proof
  • ...and 3 more