Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation
Yufan Zhou, Ruiyi Zhang, Kaizhi Zheng, Nanxuan Zhao, Jiuxiang Gu, Zichao Wang, Xin Eric Wang, Tong Sun
TL;DR
The paper tackles the high cost of constructing large-scale subject-driven text-to-image datasets by introducing Toffee, a cost-efficient pipeline that pre-trains two diffusion models (Refiner and View Generator) and enables millions of high-quality image-text pairs without subject-level fine-tuning. It constructs Toffee-5M, a 4.8M image-pair dataset (including 1.6M editing pairs) with masks, by leveraging pre-trained diffusion models, DINO embeddings, Grounded-SAM, and CLIP/DINO filtering, resulting in dramatically reduced GPU hours. A unified model, ToffeeNet, trained on this dataset, can perform both subject-driven editing and generation in a tuning-free, zero-shot manner and achieves competitive DreamBench results. The work demonstrates the practical impact of scalable synthetic data for subject-driven generation and editing, enabling broad accessibility and rapid experimentation in the community.
Abstract
In subject-driven text-to-image generation, recent works have achieved superior performance by training the model on synthetic datasets containing numerous image pairs. Trained on these datasets, generative models can produce text-aligned images for specific subject from arbitrary testing image in a zero-shot manner. They even outperform methods which require additional fine-tuning on testing images. However, the cost of creating such datasets is prohibitive for most researchers. To generate a single training pair, current methods fine-tune a pre-trained text-to-image model on the subject image to capture fine-grained details, then use the fine-tuned model to create images for the same subject based on creative text prompts. Consequently, constructing a large-scale dataset with millions of subjects can require hundreds of thousands of GPU hours. To tackle this problem, we propose Toffee, an efficient method to construct datasets for subject-driven editing and generation. Specifically, our dataset construction does not need any subject-level fine-tuning. After pre-training two generative models, we are able to generate infinite number of high-quality samples. We construct the first large-scale dataset for subject-driven image editing and generation, which contains 5 million image pairs, text prompts, and masks. Our dataset is 5 times the size of previous largest dataset, yet our cost is tens of thousands of GPU hours lower. To test the proposed dataset, we also propose a model which is capable of both subject-driven image editing and generation. By simply training the model on our proposed dataset, it obtains competitive results, illustrating the effectiveness of the proposed dataset construction framework.
