JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation
Yu Zeng, Vishal M. Patel, Haochen Wang, Xun Huang, Ting-Chun Wang, Ming-Yu Liu, Yogesh Balaji
TL;DR
JeDi presents a finetuning-free approach to personalized text-to-image generation by learning a joint distribution over sets of related images that share a subject. It achieves this with a scalable synthetic data pipeline (S^3) and a joint-image diffusion model that couples self-attention across multiple images, enabling faithful identity preservation across diverse prompts without encoder losses. Personalization is performed as inpainting within the joint-image set, using reference images and image guidance to improve fidelity, and the method demonstrates state-of-the-art results against both finetuning-based and finetuning-free baselines. The approach offers practical benefits in speed and resource efficiency while maintaining high visual fidelity, though it requires conditioning on all references at inference and points to future work on database-scale personalization and multi-subject generation.
Abstract
Personalized text-to-image generation models enable users to create images that depict their individual possessions in diverse scenes, finding applications in various domains. To achieve the personalization capability, existing methods rely on finetuning a text-to-image foundation model on a user's custom dataset, which can be non-trivial for general users, resource-intensive, and time-consuming. Despite attempts to develop finetuning-free methods, their generation quality is much lower compared to their finetuning counterparts. In this paper, we propose Joint-Image Diffusion (\jedi), an effective technique for learning a finetuning-free personalization model. Our key idea is to learn the joint distribution of multiple related text-image pairs that share a common subject. To facilitate learning, we propose a scalable synthetic dataset generation technique. Once trained, our model enables fast and easy personalization at test time by simply using reference images as input during the sampling process. Our approach does not require any expensive optimization process or additional modules and can faithfully preserve the identity represented by any number of reference images. Experimental results show that our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.
