Personalized Image Generation for Recommendations Beyond Catalogs
Gabriel Patron, Zhiwei Xu, Ishan Kapnadak, Felipe Maia Polo
TL;DR
REBECA tackles personalization in diffusion-based image generation by learning a lightweight user-conditioned diffusion prior from implicit feedback and decoupling personalization from the image generator. The method samples personalized CLIP-space embeddings from $p_{\hat{\theta}}(I^e \mid U,R)$ and decodes them with a frozen backbone, enabling scalable, fine-tuning-free customization across many users. A rigorous evaluation framework, including a personalization verifier and permutation tests, demonstrates strong alignment with individual preferences on synthetic and real datasets, while maintaining high image quality. The work enables practical, large-scale personalized generation for recommender-style applications without the computational burden of per-user fine-tuning or LLM mediation.
Abstract
Personalization is central to human-AI interaction, yet current diffusion-based image generation systems remain largely insensitive to user diversity. Existing attempts to address this often rely on costly paired preference data or introduce latency through Large Language Models. In this work, we introduce REBECA (REcommendations BEyond CAtalogs), a lightweight and scalable framework for personalized image generation that learns directly from implicit feedback signals such as likes, ratings, and clicks. Instead of fine-tuning the underlying diffusion model, REBECA employs a two-stage process: training a conditional diffusion model to sample user- and rating-specific image embeddings, which are subsequently decoded into images using a pretrained diffusion backbone. This approach enables efficient, fine-tuning-free personalization across large user bases. We rigorously evaluate REBECA on real-world datasets, proposing a novel statistical personalization verifier and a permutation-based hypothesis test to assess preference alignment. Our results demonstrate that REBECA consistently produces high-fidelity images tailored to individual tastes, outperforming baselines while maintaining computational efficiency.
