Table of Contents
Fetching ...

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

Yu Zeng, Vishal M. Patel, Haochen Wang, Xun Huang, Ting-Chun Wang, Ming-Yu Liu, Yogesh Balaji

TL;DR

JeDi presents a finetuning-free approach to personalized text-to-image generation by learning a joint distribution over sets of related images that share a subject. It achieves this with a scalable synthetic data pipeline (S^3) and a joint-image diffusion model that couples self-attention across multiple images, enabling faithful identity preservation across diverse prompts without encoder losses. Personalization is performed as inpainting within the joint-image set, using reference images and image guidance to improve fidelity, and the method demonstrates state-of-the-art results against both finetuning-based and finetuning-free baselines. The approach offers practical benefits in speed and resource efficiency while maintaining high visual fidelity, though it requires conditioning on all references at inference and points to future work on database-scale personalization and multi-subject generation.

Abstract

Personalized text-to-image generation models enable users to create images that depict their individual possessions in diverse scenes, finding applications in various domains. To achieve the personalization capability, existing methods rely on finetuning a text-to-image foundation model on a user's custom dataset, which can be non-trivial for general users, resource-intensive, and time-consuming. Despite attempts to develop finetuning-free methods, their generation quality is much lower compared to their finetuning counterparts. In this paper, we propose Joint-Image Diffusion (\jedi), an effective technique for learning a finetuning-free personalization model. Our key idea is to learn the joint distribution of multiple related text-image pairs that share a common subject. To facilitate learning, we propose a scalable synthetic dataset generation technique. Once trained, our model enables fast and easy personalization at test time by simply using reference images as input during the sampling process. Our approach does not require any expensive optimization process or additional modules and can faithfully preserve the identity represented by any number of reference images. Experimental results show that our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

TL;DR

JeDi presents a finetuning-free approach to personalized text-to-image generation by learning a joint distribution over sets of related images that share a subject. It achieves this with a scalable synthetic data pipeline (S^3) and a joint-image diffusion model that couples self-attention across multiple images, enabling faithful identity preservation across diverse prompts without encoder losses. Personalization is performed as inpainting within the joint-image set, using reference images and image guidance to improve fidelity, and the method demonstrates state-of-the-art results against both finetuning-based and finetuning-free baselines. The approach offers practical benefits in speed and resource efficiency while maintaining high visual fidelity, though it requires conditioning on all references at inference and points to future work on database-scale personalization and multi-subject generation.

Abstract

Personalized text-to-image generation models enable users to create images that depict their individual possessions in diverse scenes, finding applications in various domains. To achieve the personalization capability, existing methods rely on finetuning a text-to-image foundation model on a user's custom dataset, which can be non-trivial for general users, resource-intensive, and time-consuming. Despite attempts to develop finetuning-free methods, their generation quality is much lower compared to their finetuning counterparts. In this paper, we propose Joint-Image Diffusion (\jedi), an effective technique for learning a finetuning-free personalization model. Our key idea is to learn the joint distribution of multiple related text-image pairs that share a common subject. To facilitate learning, we propose a scalable synthetic dataset generation technique. Once trained, our model enables fast and easy personalization at test time by simply using reference images as input during the sampling process. Our approach does not require any expensive optimization process or additional modules and can faithfully preserve the identity represented by any number of reference images. Experimental results show that our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.
Paper Structure (18 sections, 4 equations, 19 figures, 10 tables, 1 algorithm)

This paper contains 18 sections, 4 equations, 19 figures, 10 tables, 1 algorithm.

Figures (19)

  • Figure 1: We present Joint-Image Diffusion (JeDi), a finetuning-free image personalization model that can operate on any number of reference images. JeDi is able to preserve the appearance of custom subjects while generating novel variations. As shown in the top row, JeDi does not suffer from the issues of overfitting and lack of diversity exhibited by the prior models. The examples in the bottom two rows demonstrate JeDi's high-quality results on challenging personalization tasks.
  • Figure 2: Overall framework. (a) We generate training data by using large language models and prompting pretrained single-image diffusion models. (b) During training, the JeDi model learns to denoise multiple same-subject images together, where each image attends to every image of the same subject set through coupled self-attention. (c) At inference, personalized generation is performed in an inpainting fashion where the goal is to generate the missing images of the joint-image set.
  • Figure 3: Data generation process. We construct Synthetic Same-Subject (S$^3$) dataset by first prompting the pretrained text-to-image diffusion models to generate same-subject photo collages, and then increasing the diversity using text-based background inpainting.
  • Figure 4: Samples from the synthetic same-subject (S$^3$) dataset. Each column denotes different images from one joint-data sample.
  • Figure 5: Visualization of the coupled self-attentions. For both scales (8x8 and 16x16), the correspondence map (Corr.) shows the connections with the highest weights between elements in the two images. The heatmap visualizes the distribution of the attention weights in an image for a specific element in another image (marked with a red box). We observe that similar regions in different images are co-attended in the coupled self-attention layers.
  • ...and 14 more figures