Table of Contents
Fetching ...

TAC-GAN - Text Conditioned Auxiliary Classifier Generative Adversarial Network

Ayushman Dash, John Cristian Borges Gamboa, Sheraz Ahmed, Marcus Liwicki, Muhammad Zeshan Afzal

TL;DR

TAC-GAN addresses text-to-image synthesis by conditioning the generator on text embeddings while training a text-aware discriminator, extending AC-GAN to use textual descriptions rather than class labels. Using Skip-Thought embeddings on the Oxford-102 Flowers dataset, it demonstrates improved discriminability (Inception Score up ~7.8% over StackGAN) and strong diversity (MS-SSIM metrics), while enabling content/style disentanglement and interpolation in both noise and text spaces. The approach is easily extensible to additional conditioning information and could be further enhanced with multi-stage refinement pipelines. Overall, TAC-GAN advances text-conditioned image synthesis by achieving higher-quality, more diverse, and semantically faithful outputs.

Abstract

In this work, we present the Text Conditioned Auxiliary Classifier Generative Adversarial Network, (TAC-GAN) a text to image Generative Adversarial Network (GAN) for synthesizing images from their text descriptions. Former approaches have tried to condition the generative process on the textual data; but allying it to the usage of class information, known to diversify the generated samples and improve their structural coherence, has not been explored. We trained the presented TAC-GAN model on the Oxford-102 dataset of flowers, and evaluated the discriminability of the generated images with Inception-Score, as well as their diversity using the Multi-Scale Structural Similarity Index (MS-SSIM). Our approach outperforms the state-of-the-art models, i.e., its inception score is 3.45, corresponding to a relative increase of 7.8% compared to the recently introduced StackGan. A comparison of the mean MS-SSIM scores of the training and generated samples per class shows that our approach is able to generate highly diverse images with an average MS-SSIM of 0.14 over all generated classes.

TAC-GAN - Text Conditioned Auxiliary Classifier Generative Adversarial Network

TL;DR

TAC-GAN addresses text-to-image synthesis by conditioning the generator on text embeddings while training a text-aware discriminator, extending AC-GAN to use textual descriptions rather than class labels. Using Skip-Thought embeddings on the Oxford-102 Flowers dataset, it demonstrates improved discriminability (Inception Score up ~7.8% over StackGAN) and strong diversity (MS-SSIM metrics), while enabling content/style disentanglement and interpolation in both noise and text spaces. The approach is easily extensible to additional conditioning information and could be further enhanced with multi-stage refinement pipelines. Overall, TAC-GAN advances text-conditioned image synthesis by achieving higher-quality, more diverse, and semantically faithful outputs.

Abstract

In this work, we present the Text Conditioned Auxiliary Classifier Generative Adversarial Network, (TAC-GAN) a text to image Generative Adversarial Network (GAN) for synthesizing images from their text descriptions. Former approaches have tried to condition the generative process on the textual data; but allying it to the usage of class information, known to diversify the generated samples and improve their structural coherence, has not been explored. We trained the presented TAC-GAN model on the Oxford-102 dataset of flowers, and evaluated the discriminability of the generated images with Inception-Score, as well as their diversity using the Multi-Scale Structural Similarity Index (MS-SSIM). Our approach outperforms the state-of-the-art models, i.e., its inception score is 3.45, corresponding to a relative increase of 7.8% compared to the recently introduced StackGan. A comparison of the mean MS-SSIM scores of the training and generated samples per class shows that our approach is able to generate highly diverse images with an average MS-SSIM of 0.14 over all generated classes.

Paper Structure

This paper contains 17 sections, 9 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Images generated by the TAC-GAN given a text descriptions. The text on the left were used to generate the images on the right. The highlighted image on the right is a real image corresponding to the text description.
  • Figure 2: The architecture of the TAC-GAN. Here, $t$ is a text description of an image, $z$ is a noise vector of size $N_z$, $I_{real}$ and $I_{wrong}$ are the real and wrong images respectively, $I_{fake}$ is the image synthesized by the generator network $G$, $\Psi(t)$ is the text embedding for the text $t$ of size $N_t$, and $C_r$ and $C_w$ are one-hot encoded class labels of the $I_{real}$ and $I_{wrong}$, respectively. $L_G$ and $L_D$ are two neural networks that generate latent representations of size $N_l$ each, for the text embedding $\Psi(t)$. $D_S$ and $D_C$ are the probability distribution that the Discriminator outputs over the sources (real/fake) and the classes respectively.
  • Figure 3: Images synthesized from text descriptions using different noise vectors. In each block, the images at the bottom are generated from the text embeddings of the image description and a noise vector. The image on the top of each block are real images corresponding to the text description.
  • Figure 4: For each block, two noise vectors $\textbf{z}_1$ and $\textbf{z}_2$ are generated. They are used to synthesize the images in the extremes. For the images in between, an interpolation between the two vectors is used. The text embedding used to produce the images is the same for the entire block. It is produced from the textual description in the left. For comparison, the Ground Truth image is highlighted. As can be seen, the style of the synthesized images changes, but the content remains roughly the same, based on that of the text input.
  • Figure 5: All images generated in the first and second row use the same noise vector $\textbf{z}_1$. Similarly, all images generated in the third and fourth rows use the same noise vector $\textbf{z}_2$. The first image of the first row and the last image of the second row use a text embedding that was constructed from the captions 1 and 2, respectively. An interpolation between these two embeddings was used for synthesizing all images in between them. The first image of the third row and the last image of the fourth row use a text embedding constructed from the captions 3 and 4. An interpolation between these two embeddings was used for synthesizing all images in between them. Notice that captions 2 and 3 are the same, but we use a different noise vector to generate the two different outputs.
  • ...and 2 more figures