Table of Contents
Fetching ...

Text-To-Image with Generative Adversarial Networks

Mehrshad Momen-Tayefeh

TL;DR

This work tackles the problem of generating realistic images from natural language descriptions by surveying five GAN-based text-to-image methods (e.g., GAN-CLS, StackGAN, AttnGAN, and SDN). It analyzes architectural differences, datasets, evaluation metrics, and output resolutions to compare performance across standard datasets like CUB-200-2011, Oxford-102, and MSCOCO. The study finds that AttnGAN often achieves the best Inception Score on MSCOCO, while SDN excels on simpler datasets, highlighting the value of attention mechanisms and multi-stage generation for fidelity. The results inform model selection and evaluation practices in text-to-image synthesis, emphasizing the practical impact of architectural choices on image realism and caption alignment.

Abstract

Generating realistic images from human texts is one of the most challenging problems in the field of computer vision (CV). The meaning of descriptions given can be roughly reflected by existing text-to-image approaches. In this paper, our main purpose is to propose a brief comparison between five different methods base on the Generative Adversarial Networks (GAN) to make image from the text. In addition, each model architectures synthesis images with different resolution. Furthermore, the best and worst obtained resolutions is 64*64, 256*256 respectively. However, we checked and compared some metrics that introduce the accuracy of each model. Also, by doing this study, we found out the best model for this problem by comparing these different approaches essential metrics.

Text-To-Image with Generative Adversarial Networks

TL;DR

This work tackles the problem of generating realistic images from natural language descriptions by surveying five GAN-based text-to-image methods (e.g., GAN-CLS, StackGAN, AttnGAN, and SDN). It analyzes architectural differences, datasets, evaluation metrics, and output resolutions to compare performance across standard datasets like CUB-200-2011, Oxford-102, and MSCOCO. The study finds that AttnGAN often achieves the best Inception Score on MSCOCO, while SDN excels on simpler datasets, highlighting the value of attention mechanisms and multi-stage generation for fidelity. The results inform model selection and evaluation practices in text-to-image synthesis, emphasizing the practical impact of architectural choices on image realism and caption alignment.

Abstract

Generating realistic images from human texts is one of the most challenging problems in the field of computer vision (CV). The meaning of descriptions given can be roughly reflected by existing text-to-image approaches. In this paper, our main purpose is to propose a brief comparison between five different methods base on the Generative Adversarial Networks (GAN) to make image from the text. In addition, each model architectures synthesis images with different resolution. Furthermore, the best and worst obtained resolutions is 64*64, 256*256 respectively. However, we checked and compared some metrics that introduce the accuracy of each model. Also, by doing this study, we found out the best model for this problem by comparing these different approaches essential metrics.

Paper Structure

This paper contains 6 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Example of different generated images from CUB-200 [2] and MSCOCO [3] dataset. First row images are from CUB-200-2011 and second are from MSCOCO dataset.
  • Figure 2: Simple architecture of a DCGAN that consist of 5 deconvolutional layers in generator and 5 convolutional layers in discriminator that generate a 64$\times$64$\times$3 image from noise vector with 100$\times$1 dimensions.
  • Figure 3: Example result of different images that generated from each mention methods on CUB-200-2011 dataset.