Text-to-Image GAN with Pretrained Representations

Xiaozhou You; Jian Zhang

Text-to-Image GAN with Pretrained Representations

Xiaozhou You, Jian Zhang

TL;DR

This work proposes TIGER, a text-to-image GAN with pretrained representations, a vision-empowered discriminator and a high-capacity generator that aims to achieve effective text-image fusion while increasing the model capacity.

Abstract

Generating desired images conditioned on given text descriptions has received lots of attention. Recently, diffusion models and autoregressive models have demonstrated their outstanding expressivity and gradually replaced GAN as the favored architectures for text-to-image synthesis. However, they still face some obstacles: slow inference speed and expensive training costs. To achieve more powerful and faster text-to-image synthesis under complex scenes, we propose TIGER, a text-to-image GAN with pretrained representations. To be specific, we propose a vision-empowered discriminator and a high-capacity generator. (i) The vision-empowered discriminator absorbs the complex scene understanding ability and the domain generalization ability from pretrained vision models to enhance model performance. Unlike previous works, we explore stacking multiple pretrained models in our discriminator to collect multiple different representations. (ii) The high-capacity generator aims to achieve effective text-image fusion while increasing the model capacity. The high-capacity generator consists of multiple novel high-capacity fusion blocks (HFBlock). And the HFBlock contains several deep fusion modules and a global fusion module, which play different roles to benefit our model. Extensive experiments demonstrate the outstanding performance of our proposed TIGER both on standard and zero-shot text-to-image synthesis tasks. On the standard text-to-image synthesis task, TIGER achieves state-of-the-art performance on two challenging datasets, which obtain a new FID 5.48 (COCO) and 9.38 (CUB). On the zero-shot text-to-image synthesis task, we achieve comparable performance with fewer model parameters, smaller training data size and faster inference speed. Additionally, more experiments and analyses are conducted in the Supplementary Material.

Text-to-Image GAN with Pretrained Representations

TL;DR

Abstract

Paper Structure (18 sections, 4 equations, 6 figures, 3 tables)

This paper contains 18 sections, 4 equations, 6 figures, 3 tables.

Introduction
Related Works
Text-to-Image Synthesis.
Large-Scale Text-to-Image Models.
Methods
Model Overview
High-Capacity Fusion Block
Deep fusion Module.
Global fusion Module.
Vision-Empowered Discriminator
Loss Function
Semantic Contrastive Loss.
Overall Loss.
Experiments
Quantitative Evaluation
...and 3 more sections

Figures (6)

Figure 1: (a): Existing text-to-image GANs were trained from scratch. (b): To build a more powerful and faster GAN, our proposed TIGER consists of a vision-empowered discriminator and a high-capacity generator. The vision-empowered discriminator consists of several sub discriminator $D_{i}$, which contains the model $H_{i}$ selected from the pretrained vision model bank $\mathcal{H}$ to enhance the complex scene understanding ability and domain generalization ability. And our high-capacity generator can achieve effective cross-modal text-image fusion while increasing model capacity.
Figure 2: The outstanding performance of our proposed TIGER. (a): On the standard text-to-image synthesis task, our TIGER achieves a new state-of-the-art FID 5.48 (COCO) and 9.38 (CUB). Lower FID means better. (b): On the zero-shot text-to-image synthesis task, our TIGER achieves comparable performance (zero-shot FID) with fewer model parameters, smaller training data size and 120$\times$ faster inference speed than LDM (DF) and Parti-350M (AR).
Figure 3: The architectures we investigate. (a): The high-capacity generator consists of several high-capacity fusion blocks to generate desired images under complex scenes. (b): The high-capacity fusion block includes several deep fusion modules and a global fusion module to further improve model capacity, in which "D-Conv" stands for dilated convolutional network. (c): The affine module can achieve effective cross-modal text-image fusion.
Figure 4: (a): The architecture of sub discriminator in our vision-empowered discriminator. In our vision-empowered discriminator, each sub discriminator processes the representation from different pretrained vision networks. (b) and (c): The different architectures of adapter. For multi-level features, we try two different adapters to enhance model performance.
Figure 5: Qualitative comparison between AttnGAN, DF-GAN, and our proposed TIGER conditioned on text descriptions from the test set of COCO datasets (1st - 4th columns) and CUB datasets (5th - 8th columns).
...and 1 more figures

Text-to-Image GAN with Pretrained Representations

TL;DR

Abstract

Text-to-Image GAN with Pretrained Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (6)