Generating Multi-Image Synthetic Data for Text-to-Image Customization
Nupur Kumari, Xi Yin, Jun-Yan Zhu, Ishan Misra, Samaneh Azadi
TL;DR
This work tackles the scarcity of multi-image supervision for personalized text-to-image customization by introducing SynCD, a synthetic dataset generated from 3D assets and LLM-guided prompts to yield multiple views of the same object. An encoder-based customization model is trained with a Shared Attention mechanism to condition on multiple reference images, and a normalization-based inference strategy mitigates overexposure while following text prompts. Experiments show that the SynCD-trained model outperforms leading encoder-based customization methods and remains competitive with optimization-based approaches on standard benchmarks, achieving strong object identity preservation and text alignment. The approach enables scalable, tuning-free personalization of text-to-image models and paves the way for broader, data-efficient customization at scale.
Abstract
Customization of text-to-image models enables users to insert new concepts or objects and generate them in unseen settings. Existing methods either rely on comparatively expensive test-time optimization or train encoders on single-image datasets without multi-image supervision, which can limit image quality. We propose a simple approach to address these challenges. We first leverage existing text-to-image models and 3D datasets to create a high-quality Synthetic Customization Dataset (SynCD) consisting of multiple images of the same object in different lighting, backgrounds, and poses. Using this dataset, we train an encoder-based model that incorporates fine-grained visual details from reference images via a shared attention mechanism. Finally, we propose an inference technique that normalizes text and image guidance vectors to mitigate overexposure issues in sampled images. Through extensive experiments, we show that our encoder-based model, trained on SynCD, and with the proposed inference algorithm, improves upon existing encoder-based methods on standard customization benchmarks.
