Text-To-Image with Generative Adversarial Networks

Mehrshad Momen-Tayefeh

Text-To-Image with Generative Adversarial Networks

Mehrshad Momen-Tayefeh

TL;DR

This work tackles the problem of generating realistic images from natural language descriptions by surveying five GAN-based text-to-image methods (e.g., GAN-CLS, StackGAN, AttnGAN, and SDN). It analyzes architectural differences, datasets, evaluation metrics, and output resolutions to compare performance across standard datasets like CUB-200-2011, Oxford-102, and MSCOCO. The study finds that AttnGAN often achieves the best Inception Score on MSCOCO, while SDN excels on simpler datasets, highlighting the value of attention mechanisms and multi-stage generation for fidelity. The results inform model selection and evaluation practices in text-to-image synthesis, emphasizing the practical impact of architectural choices on image realism and caption alignment.

Abstract

Generating realistic images from human texts is one of the most challenging problems in the field of computer vision (CV). The meaning of descriptions given can be roughly reflected by existing text-to-image approaches. In this paper, our main purpose is to propose a brief comparison between five different methods base on the Generative Adversarial Networks (GAN) to make image from the text. In addition, each model architectures synthesis images with different resolution. Furthermore, the best and worst obtained resolutions is 64*64, 256*256 respectively. However, we checked and compared some metrics that introduce the accuracy of each model. Also, by doing this study, we found out the best model for this problem by comparing these different approaches essential metrics.

Text-To-Image with Generative Adversarial Networks

TL;DR

Abstract

Text-To-Image with Generative Adversarial Networks

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)