Table of Contents
Fetching ...

PrefGen: Multimodal Preference Learning for Preference-Conditioned Image Generation

Wenyi Mo, Tianyu Zhang, Yalong Bai, Ligong Han, Ying Ba, Dimitris N. Metaxas

TL;DR

PrefGen tackles personalization in diffusion-based image generation by extracting user-specific representations from limited reference images with a multimodal language model trained on preference-oriented VQA. It separates stable identity cues (e_core) from context-dependent semantic preferences (e_sem) and augments these with an image anchor (e_img), then aligns the semantic embedding to the diffusion text space via a maximum mean discrepancy loss. The unified user representation is injected into the generator through a lightweight IP-Adapter cross-attention pathway, enabling faithful adherence to prompts while capturing individual aesthetics. Extensive experiments on synthetic and real-world data show PrefGen outperforms competitive baselines in both image quality and preference alignment, supported by human evaluations and robust ablations. The approach offers a scalable, data-efficient route to personalized multimodal generation with strong generalization capabilities.

Abstract

Preference-conditioned image generation seeks to adapt generative models to individual users, producing outputs that reflect personal aesthetic choices beyond the given textual prompt. Despite recent progress, existing approaches either fail to capture nuanced user preferences or lack effective mechanisms to encode personalized visual signals. In this work, we propose a multimodal framework that leverages multimodal large language models (MLLMs) to extract rich user representations and inject them into diffusion-based image generation. We train the MLLM with a preference-oriented visual question answering task to capture fine-grained semantic cues. To isolate preference-relevant features, we introduce two complementary probing tasks: inter-user discrimination to distinguish between different users, and intra-user discrimination to separate liked from disliked content. To ensure compatibility with diffusion text encoders, we design a maximum mean discrepancy-based alignment loss that bridges the modality gap while preserving multimodal structure. The resulting embeddings are used to condition the generator, enabling faithful adherence to both prompts and user preferences. Extensive experiments demonstrate that our method substantially outperforms strong baselines in both image quality and preference alignment, highlighting the effectiveness of representation extraction and alignment for personalized generation.

PrefGen: Multimodal Preference Learning for Preference-Conditioned Image Generation

TL;DR

PrefGen tackles personalization in diffusion-based image generation by extracting user-specific representations from limited reference images with a multimodal language model trained on preference-oriented VQA. It separates stable identity cues (e_core) from context-dependent semantic preferences (e_sem) and augments these with an image anchor (e_img), then aligns the semantic embedding to the diffusion text space via a maximum mean discrepancy loss. The unified user representation is injected into the generator through a lightweight IP-Adapter cross-attention pathway, enabling faithful adherence to prompts while capturing individual aesthetics. Extensive experiments on synthetic and real-world data show PrefGen outperforms competitive baselines in both image quality and preference alignment, supported by human evaluations and robust ablations. The approach offers a scalable, data-efficient route to personalized multimodal generation with strong generalization capabilities.

Abstract

Preference-conditioned image generation seeks to adapt generative models to individual users, producing outputs that reflect personal aesthetic choices beyond the given textual prompt. Despite recent progress, existing approaches either fail to capture nuanced user preferences or lack effective mechanisms to encode personalized visual signals. In this work, we propose a multimodal framework that leverages multimodal large language models (MLLMs) to extract rich user representations and inject them into diffusion-based image generation. We train the MLLM with a preference-oriented visual question answering task to capture fine-grained semantic cues. To isolate preference-relevant features, we introduce two complementary probing tasks: inter-user discrimination to distinguish between different users, and intra-user discrimination to separate liked from disliked content. To ensure compatibility with diffusion text encoders, we design a maximum mean discrepancy-based alignment loss that bridges the modality gap while preserving multimodal structure. The resulting embeddings are used to condition the generator, enabling faithful adherence to both prompts and user preferences. Extensive experiments demonstrate that our method substantially outperforms strong baselines in both image quality and preference alignment, highlighting the effectiveness of representation extraction and alignment for personalized generation.

Paper Structure

This paper contains 37 sections, 4 equations, 20 figures, 14 tables.

Figures (20)

  • Figure 1: Overview of our framework. Step 1: Fine-tune an MLLM on preference-oriented VQA. Step 2: Perform layer analysis to extract identity embedding $\mathbf{e}_{\text{core}}$ and semantic preference embedding $\mathbf{e}_{\text{sem}}$. Step 3: Align $\mathbf{e}_{\text{sem}}$ with the text encoder space using an MMD loss, producing $\hat{\mathbf{e}}_{\text{sem}}$. Step 4: Inject $\hat{\mathbf{e}}_{\text{sem}}, \mathbf{e}_{\text{core}}, \mathbf{e}_{\text{img}}$ into the base model via cross-attention for preference-conditioned generation.
  • Figure 2: Study comparing different pooling strategies and layer selections for the preference discrimination task. The results show that using embeddings $\mathbf{e}_{sem}$ from the top four layers with the last-token strategy achieves the best performance in like–dislike discrimination.
  • Figure 3: Qualitative comparison with different methods. Each row shows the user’s preference and outputs from different approaches. PrefGen consistently captures both stylistic and semantic aspects of user preference, while others often fail to balance preference alignment and prompt fidelity.
  • Figure 4: Application examples of PrefGen. (a) Product design: PrefGen integrates the color composition extracted from a user’s preference history into the design of a rabbit-shaped lamp, aligning the generated output with user-specific aesthetic demands. (b) Character design: Given visual attribute specifications, PrefGen generates characters with distinct background colors while preserving the desired attributes.
  • Figure 5: Evaluation by human experts.
  • ...and 15 more figures