Table of Contents
Fetching ...

Premier: Personalized Preference Modulation with Learnable User Embedding in Text-to-Image Generation

Zihao Wang, Yuxiang Wei, Xinpeng Zhou, Tianyu Zhang, Tao Liang, Yalong Bai, Hongzhi Zhang, Wangmeng Zuo

Abstract

Text-to-image generation has advanced rapidly, yet it still struggles to capture the nuanced user preferences. Existing approaches typically rely on multimodal large language models to infer user preferences, but the derived prompts or latent codes rarely reflect them faithfully, leading to suboptimal personalization. We present Premier, a novel preference modulation framework for personalized image generation. Premier represents each user's preference as a learnable embedding and introduces a preference adapter that fuses the user embedding with the text prompt. To enable accurate and fine-grained preference control, the fused preference embedding is further used to modulate the generative process. To enhance the distinctness of individual preference and improve alignment between outputs and user-specific styles, we incorporate a dispersion loss that enforces separation among user embeddings. When user data are scarce, new users are represented as linear combinations of existing preference embeddings learned during training, enabling effective generalization. Experiments show that Premier outperforms prior methods under the same history length, achieving stronger preference alignment and superior performance on text consistency, ViPer proxy metrics, and expert evaluations.

Premier: Personalized Preference Modulation with Learnable User Embedding in Text-to-Image Generation

Abstract

Text-to-image generation has advanced rapidly, yet it still struggles to capture the nuanced user preferences. Existing approaches typically rely on multimodal large language models to infer user preferences, but the derived prompts or latent codes rarely reflect them faithfully, leading to suboptimal personalization. We present Premier, a novel preference modulation framework for personalized image generation. Premier represents each user's preference as a learnable embedding and introduces a preference adapter that fuses the user embedding with the text prompt. To enable accurate and fine-grained preference control, the fused preference embedding is further used to modulate the generative process. To enhance the distinctness of individual preference and improve alignment between outputs and user-specific styles, we incorporate a dispersion loss that enforces separation among user embeddings. When user data are scarce, new users are represented as linear combinations of existing preference embeddings learned during training, enabling effective generalization. Experiments show that Premier outperforms prior methods under the same history length, achieving stronger preference alignment and superior performance on text consistency, ViPer proxy metrics, and expert evaluations.
Paper Structure (21 sections, 7 equations, 16 figures, 5 tables)

This paper contains 21 sections, 7 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: In our approach, user preference descriptions are not required and only user-provided preference images are needed. A learnable user embedding is obtained by training on these images, and this embedding accurately captures the user’s preference information.
  • Figure 2: Premier training framework. (a) During the training of the preference adapters, the user preference embeddings and the adapters are jointly optimized. The block-shared adapter produces a uniform modulation direction across all DiT blocks, whereas the block-distinct adapter generates different modulation directions for different DiT blocks. (b) Each preference adapter takes the learnable user embedding and the input text tokens as inputs, and outputs a preference modulation direction for every text token, enabling fine-grained and context-aware modulation. (c) Our method obtains the new user’s preference embedding as a linear combination of training-set user preference embeddings. During this stage, only the linear combination coefficients are optimized. This strategy yields a more stable user preference embedding when the user’s historical data is limited.
  • Figure 3: Qualitative comparisons of Preference Alignment. We compare the performance of our method with other approaches in user preference-aware image generation. The images generated by our method are closest to the user’s preferences while remaining faithful to the user-provided text prompt.
  • Figure 4: User study results of our method compared with other methods. Each human expert is presented with six historical preference images from the user, along with image pairs generated by our method and other baselines under the same text prompt. Experts are asked to select the image that best aligns with both the user’s preferences and the input text.
  • Figure 5: Qualitative ablation comparison of our method. Ablating either of the two preference adapters leads to a significant performance drop, confirming their necessity. Ablating the text-preference modulation also degrades user-preference-aware image generation.
  • ...and 11 more figures