Lafite2: Few-shot Text-to-Image Generation
Yufan Zhou, Chunyuan Li, Changyou Chen, Jianfeng Gao, Jinhui Xu
TL;DR
Lafite2 presents a data-efficient, language-free pre-training framework for text-to-image generation that leverages image-only data through a retrieval-augmented synthesis of pseudo text features and a contrastive latent optimization to align these features with images in CLIP space. The method is instantiated for both GANs (Lafite2_GAN) and latent diffusion models (Lafite2_LDM), achieving state-of-the-art GAN performance on MS-COCO under full supervision and competitive zero-shot/few-shot results for diffusion with significantly smaller models. Extensive experiments demonstrate strong in-domain, near-domain, and multi-domain transfer, highlighting robust few-shot and semi-supervised capabilities while maintaining efficiency. Overall, Lafite2 reduces the annotation burden and enables broad applicability across model families and datasets, offering practical benefits for researchers with limited access to web-scale image-text corpora.
Abstract
Text-to-image generation models have progressed considerably in recent years, which can now generate impressive realistic images from arbitrary text. Most of such models are trained on web-scale image-text paired datasets, which may not be affordable for many researchers. In this paper, we propose a novel method for pre-training text-to-image generation model on image-only datasets. It considers a retrieval-then-optimization procedure to synthesize pseudo text features: for a given image, relevant pseudo text features are first retrieved, then optimized for better alignment. The low requirement of the proposed method yields high flexibility and usability: it can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning; it can be applied on different models including generative adversarial networks (GANs) and diffusion models. Extensive experiments illustrate the effectiveness of the proposed method. On MS-COCO dataset, our GAN model obtains Fréchet Inception Distance (FID) of 6.78 which is the new state-of-the-art (SoTA) of GANs under fully-supervised setting. Our diffusion model obtains FID of 8.42 and 4.28 on zero-shot and supervised setting respectively, which are competitive to SoTA diffusion models with a much smaller model size.
