Table of Contents
Fetching ...

Text-to-Image GAN with Pretrained Representations

Xiaozhou You, Jian Zhang

TL;DR

This work proposes TIGER, a text-to-image GAN with pretrained representations, a vision-empowered discriminator and a high-capacity generator that aims to achieve effective text-image fusion while increasing the model capacity.

Abstract

Generating desired images conditioned on given text descriptions has received lots of attention. Recently, diffusion models and autoregressive models have demonstrated their outstanding expressivity and gradually replaced GAN as the favored architectures for text-to-image synthesis. However, they still face some obstacles: slow inference speed and expensive training costs. To achieve more powerful and faster text-to-image synthesis under complex scenes, we propose TIGER, a text-to-image GAN with pretrained representations. To be specific, we propose a vision-empowered discriminator and a high-capacity generator. (i) The vision-empowered discriminator absorbs the complex scene understanding ability and the domain generalization ability from pretrained vision models to enhance model performance. Unlike previous works, we explore stacking multiple pretrained models in our discriminator to collect multiple different representations. (ii) The high-capacity generator aims to achieve effective text-image fusion while increasing the model capacity. The high-capacity generator consists of multiple novel high-capacity fusion blocks (HFBlock). And the HFBlock contains several deep fusion modules and a global fusion module, which play different roles to benefit our model. Extensive experiments demonstrate the outstanding performance of our proposed TIGER both on standard and zero-shot text-to-image synthesis tasks. On the standard text-to-image synthesis task, TIGER achieves state-of-the-art performance on two challenging datasets, which obtain a new FID 5.48 (COCO) and 9.38 (CUB). On the zero-shot text-to-image synthesis task, we achieve comparable performance with fewer model parameters, smaller training data size and faster inference speed. Additionally, more experiments and analyses are conducted in the Supplementary Material.

Text-to-Image GAN with Pretrained Representations

TL;DR

This work proposes TIGER, a text-to-image GAN with pretrained representations, a vision-empowered discriminator and a high-capacity generator that aims to achieve effective text-image fusion while increasing the model capacity.

Abstract

Generating desired images conditioned on given text descriptions has received lots of attention. Recently, diffusion models and autoregressive models have demonstrated their outstanding expressivity and gradually replaced GAN as the favored architectures for text-to-image synthesis. However, they still face some obstacles: slow inference speed and expensive training costs. To achieve more powerful and faster text-to-image synthesis under complex scenes, we propose TIGER, a text-to-image GAN with pretrained representations. To be specific, we propose a vision-empowered discriminator and a high-capacity generator. (i) The vision-empowered discriminator absorbs the complex scene understanding ability and the domain generalization ability from pretrained vision models to enhance model performance. Unlike previous works, we explore stacking multiple pretrained models in our discriminator to collect multiple different representations. (ii) The high-capacity generator aims to achieve effective text-image fusion while increasing the model capacity. The high-capacity generator consists of multiple novel high-capacity fusion blocks (HFBlock). And the HFBlock contains several deep fusion modules and a global fusion module, which play different roles to benefit our model. Extensive experiments demonstrate the outstanding performance of our proposed TIGER both on standard and zero-shot text-to-image synthesis tasks. On the standard text-to-image synthesis task, TIGER achieves state-of-the-art performance on two challenging datasets, which obtain a new FID 5.48 (COCO) and 9.38 (CUB). On the zero-shot text-to-image synthesis task, we achieve comparable performance with fewer model parameters, smaller training data size and faster inference speed. Additionally, more experiments and analyses are conducted in the Supplementary Material.
Paper Structure (18 sections, 4 equations, 6 figures, 3 tables)

This paper contains 18 sections, 4 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: (a): Existing text-to-image GANs were trained from scratch. (b): To build a more powerful and faster GAN, our proposed TIGER consists of a vision-empowered discriminator and a high-capacity generator. The vision-empowered discriminator consists of several sub discriminator $D_{i}$, which contains the model $H_{i}$ selected from the pretrained vision model bank $\mathcal{H}$ to enhance the complex scene understanding ability and domain generalization ability. And our high-capacity generator can achieve effective cross-modal text-image fusion while increasing model capacity.
  • Figure 2: The outstanding performance of our proposed TIGER. (a): On the standard text-to-image synthesis task, our TIGER achieves a new state-of-the-art FID 5.48 (COCO) and 9.38 (CUB). Lower FID means better. (b): On the zero-shot text-to-image synthesis task, our TIGER achieves comparable performance (zero-shot FID) with fewer model parameters, smaller training data size and 120$\times$ faster inference speed than LDM (DF) and Parti-350M (AR).
  • Figure 3: The architectures we investigate. (a): The high-capacity generator consists of several high-capacity fusion blocks to generate desired images under complex scenes. (b): The high-capacity fusion block includes several deep fusion modules and a global fusion module to further improve model capacity, in which "D-Conv" stands for dilated convolutional network. (c): The affine module can achieve effective cross-modal text-image fusion.
  • Figure 4: (a): The architecture of sub discriminator in our vision-empowered discriminator. In our vision-empowered discriminator, each sub discriminator processes the representation from different pretrained vision networks. (b) and (c): The different architectures of adapter. For multi-level features, we try two different adapters to enhance model performance.
  • Figure 5: Qualitative comparison between AttnGAN, DF-GAN, and our proposed TIGER conditioned on text descriptions from the test set of COCO datasets (1st - 4th columns) and CUB datasets (5th - 8th columns).
  • ...and 1 more figures