Lafite2: Few-shot Text-to-Image Generation

Yufan Zhou; Chunyuan Li; Changyou Chen; Jianfeng Gao; Jinhui Xu

Lafite2: Few-shot Text-to-Image Generation

Yufan Zhou, Chunyuan Li, Changyou Chen, Jianfeng Gao, Jinhui Xu

TL;DR

Lafite2 presents a data-efficient, language-free pre-training framework for text-to-image generation that leverages image-only data through a retrieval-augmented synthesis of pseudo text features and a contrastive latent optimization to align these features with images in CLIP space. The method is instantiated for both GANs (Lafite2_GAN) and latent diffusion models (Lafite2_LDM), achieving state-of-the-art GAN performance on MS-COCO under full supervision and competitive zero-shot/few-shot results for diffusion with significantly smaller models. Extensive experiments demonstrate strong in-domain, near-domain, and multi-domain transfer, highlighting robust few-shot and semi-supervised capabilities while maintaining efficiency. Overall, Lafite2 reduces the annotation burden and enables broad applicability across model families and datasets, offering practical benefits for researchers with limited access to web-scale image-text corpora.

Abstract

Text-to-image generation models have progressed considerably in recent years, which can now generate impressive realistic images from arbitrary text. Most of such models are trained on web-scale image-text paired datasets, which may not be affordable for many researchers. In this paper, we propose a novel method for pre-training text-to-image generation model on image-only datasets. It considers a retrieval-then-optimization procedure to synthesize pseudo text features: for a given image, relevant pseudo text features are first retrieved, then optimized for better alignment. The low requirement of the proposed method yields high flexibility and usability: it can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning; it can be applied on different models including generative adversarial networks (GANs) and diffusion models. Extensive experiments illustrate the effectiveness of the proposed method. On MS-COCO dataset, our GAN model obtains Fréchet Inception Distance (FID) of 6.78 which is the new state-of-the-art (SoTA) of GANs under fully-supervised setting. Our diffusion model obtains FID of 8.42 and 4.28 on zero-shot and supervised setting respectively, which are competitive to SoTA diffusion models with a much smaller model size.

Lafite2: Few-shot Text-to-Image Generation

TL;DR

Abstract

Paper Structure (24 sections, 1 theorem, 9 equations, 8 figures, 8 tables)

This paper contains 24 sections, 1 theorem, 9 equations, 8 figures, 8 tables.

Introduction
Preliminaries: Probing Multimodal Feature Space
Proposed Method: Retrieval-then-Optimization
Pseudo Text-Feature Synthesis: A Retrieval-Augmented Approach
Pseudo Text-Feature Refinement: Contrastive Latent Optimization
Theoretical Justification.
Lafite2 Model Instantiation
Lafite2$_{\textbf{GAN}}$.
Lafite2$_{\textbf{LDM}}$.
Experiments
Settings & Evaluation Metrics.
Unsupervised Pre-training
Zero-shot and Few-shot Task Transfer
Zero-shot.
Few-shot.
...and 9 more sections

Key Result

Theorem 1

Let $\{\mathop{\mathrm{\mathbf{x}}}\nolimits_j^\prime\}_{i=1}^n$ be a mini-batch of generated images, $\{\mathop{\mathrm{\mathbf{h}}}\nolimits_i\}_{i=1}^n$ be the corresponding text features fed into the generator $G_{{\bm{\theta}}}$. For the contrastive loss $\mathcal{L}$ in eq:contrastive_loss, we

Figures (8)

Figure 1: Distributions of cosine similarities for (a) image-text pair (including pseudo and ground-truth text); (b) text-text pairs from the same image; (c) text-text pairs from randomly sampled images. (d) Illustration of multi-modal feature space of CLIP, where blue solid arrow denotes an image feature, red dotted arrow denotes text feature of corresponding ground-truth caption, black dashed arrows denote possible pseudo text features generated with Lafite.
Figure 2: Illustration of the proposed method, which first generates pseudo text features and optimizes them inside of the multi-modal feature space of CLIP. In general, we expect the retrieval to generate pseudo text features that align with images, and optimize to make the pseudo text feature contain more discriminative semantic information (indicated in parenthesis).
Figure 3: Generated examples with captions from MS-COCO validation set. With the proposed method, our LDM leads to better zero-shot generation. The generated images have better quality and desired styles.
Figure 4: Fine-tuning after unsupervised pre-training (blue background) leads to better performance than fully-supervised training from scratch.
Figure 5: Generated examples on MM-CelebA-HQ dataset.
...and 3 more figures

Theorems & Definitions (1)

Theorem 1

Lafite2: Few-shot Text-to-Image Generation

TL;DR

Abstract

Lafite2: Few-shot Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (1)