Table of Contents
Fetching ...

Generating Multi-Image Synthetic Data for Text-to-Image Customization

Nupur Kumari, Xi Yin, Jun-Yan Zhu, Ishan Misra, Samaneh Azadi

TL;DR

This work tackles the scarcity of multi-image supervision for personalized text-to-image customization by introducing SynCD, a synthetic dataset generated from 3D assets and LLM-guided prompts to yield multiple views of the same object. An encoder-based customization model is trained with a Shared Attention mechanism to condition on multiple reference images, and a normalization-based inference strategy mitigates overexposure while following text prompts. Experiments show that the SynCD-trained model outperforms leading encoder-based customization methods and remains competitive with optimization-based approaches on standard benchmarks, achieving strong object identity preservation and text alignment. The approach enables scalable, tuning-free personalization of text-to-image models and paves the way for broader, data-efficient customization at scale.

Abstract

Customization of text-to-image models enables users to insert new concepts or objects and generate them in unseen settings. Existing methods either rely on comparatively expensive test-time optimization or train encoders on single-image datasets without multi-image supervision, which can limit image quality. We propose a simple approach to address these challenges. We first leverage existing text-to-image models and 3D datasets to create a high-quality Synthetic Customization Dataset (SynCD) consisting of multiple images of the same object in different lighting, backgrounds, and poses. Using this dataset, we train an encoder-based model that incorporates fine-grained visual details from reference images via a shared attention mechanism. Finally, we propose an inference technique that normalizes text and image guidance vectors to mitigate overexposure issues in sampled images. Through extensive experiments, we show that our encoder-based model, trained on SynCD, and with the proposed inference algorithm, improves upon existing encoder-based methods on standard customization benchmarks.

Generating Multi-Image Synthetic Data for Text-to-Image Customization

TL;DR

This work tackles the scarcity of multi-image supervision for personalized text-to-image customization by introducing SynCD, a synthetic dataset generated from 3D assets and LLM-guided prompts to yield multiple views of the same object. An encoder-based customization model is trained with a Shared Attention mechanism to condition on multiple reference images, and a normalization-based inference strategy mitigates overexposure while following text prompts. Experiments show that the SynCD-trained model outperforms leading encoder-based customization methods and remains competitive with optimization-based approaches on standard benchmarks, achieving strong object identity preservation and text alignment. The approach enables scalable, tuning-free personalization of text-to-image models and paves the way for broader, data-efficient customization at scale.

Abstract

Customization of text-to-image models enables users to insert new concepts or objects and generate them in unseen settings. Existing methods either rely on comparatively expensive test-time optimization or train encoders on single-image datasets without multi-image supervision, which can limit image quality. We propose a simple approach to address these challenges. We first leverage existing text-to-image models and 3D datasets to create a high-quality Synthetic Customization Dataset (SynCD) consisting of multiple images of the same object in different lighting, backgrounds, and poses. Using this dataset, we train an encoder-based model that incorporates fine-grained visual details from reference images via a shared attention mechanism. Finally, we propose an inference technique that normalizes text and image guidance vectors to mitigate overexposure issues in sampled images. Through extensive experiments, we show that our encoder-based model, trained on SynCD, and with the proposed inference algorithm, improves upon existing encoder-based methods on standard customization benchmarks.

Paper Structure

This paper contains 22 sections, 4 equations, 24 figures, 9 tables.

Figures (24)

  • Figure 1: (a) We propose a new pipeline for synthetic training data generation consisting of multiple images of the same object under different lighting, poses, and backgrounds. Given the dataset, we train a new encoder-based model customization method, which can take either (b) three or (c) one reference image of the object as input and successfully generate it in new compositions using text prompts.
  • Figure 2: Dataset Generation Pipeline.Top: For deformable categories like cats, we use an object description combined with a set of background prompts, both suggested by an LLM, as input to generate multiple images of the same object in different contexts. Bottom: For rigid objects, we use a depth-conditioned text-to-image model zhang2023adding. It takes depth map of Objaverse 3D assets deitke2023objaverse rendered from multiple views, its description luo2024scalable, and background context suggested by an LLM as input to generate the same object in varied poses and settings. We use Masked Shared Attention (MSA) and warping (in the case of rigid objects) to promote object consistency, as shown in Figure \ref{['fig:msa']}.
  • Figure 3: Feature warping and Masked Shared Attention (MSA) for object consistency. For rigid objects, we first warp corresponding features from the first image to the other. Then, each image feature attends to itself, and the foreground object features in other images. We show an example mask, $\mathbf{M}_1$, used to ensure this for the first image when generating two images with the same object.
  • Figure 4: Training Method. We condition the model on reference images, $\{{\mathbf{x}}_i\}_{i=1}^K$, using a Shared Attention mechanism, similar to Figure \ref{['fig:msa']}. We extract fine-grained features of the reference images using the same model and have the target image features attend to the reference image features as well in the attention blocks.
  • Figure 5: Results. We compare our method qualitatively against other leading encoder-based baselines with a single reference image as input. We can successfully incorporate the text prompt while preserving the object identity similar to or higher than the baseline methods. We pick the best out of $4$ images for all methods. More qualitative samples are shown in Figure \ref{['fig:results_comparison_1ref_1']} in the Appendix.
  • ...and 19 more figures