Table of Contents
Fetching ...

Fine-grained Text to Image Synthesis

Xu Ouyang, Ying Chen, Kaiyue Zhu, Gady Agam

TL;DR

This work tackles fine-grained text-to-image synthesis by enhancing the RAT GAN with a discriminator-side auxiliary classifier and a contrastive learning objective using cross-batch memory. The auxiliary classifier provides category-level supervision and helps the generator produce finer-grained details, while the contrastive losses enforce intra-class similarity and inter-class separation across real and generated images. Empirical results on CUB-200-2011 and Oxford-102 show improved FID with competitive IS, demonstrating stronger realism and semantic fidelity at the fine-grained level with only a modest parameter increase. Overall, FG-RAT GAN presents an effective, scalable approach to incorporate fine-grained supervisory signals into GAN-based text-to-image synthesis, with practical implications for applications requiring detailed and class-consistent imagery.

Abstract

Fine-grained text to image synthesis involves generating images from texts that belong to different categories. In contrast to general text to image synthesis, in fine-grained synthesis there is high similarity between images of different subclasses, and there may be linguistic discrepancy among texts describing the same image. Recent Generative Adversarial Networks (GAN), such as the Recurrent Affine Transformation (RAT) GAN model, are able to synthesize clear and realistic images from texts. However, GAN models ignore fine-grained level information. In this paper we propose an approach that incorporates an auxiliary classifier in the discriminator and a contrastive learning method to improve the accuracy of fine-grained details in images synthesized by RAT GAN. The auxiliary classifier helps the discriminator classify the class of images, and helps the generator synthesize more accurate fine-grained images. The contrastive learning method minimizes the similarity between images from different subclasses and maximizes the similarity between images from the same subclass. We evaluate on several state-of-the-art methods on the commonly used CUB-200-2011 bird dataset and Oxford-102 flower dataset, and demonstrated superior performance.

Fine-grained Text to Image Synthesis

TL;DR

This work tackles fine-grained text-to-image synthesis by enhancing the RAT GAN with a discriminator-side auxiliary classifier and a contrastive learning objective using cross-batch memory. The auxiliary classifier provides category-level supervision and helps the generator produce finer-grained details, while the contrastive losses enforce intra-class similarity and inter-class separation across real and generated images. Empirical results on CUB-200-2011 and Oxford-102 show improved FID with competitive IS, demonstrating stronger realism and semantic fidelity at the fine-grained level with only a modest parameter increase. Overall, FG-RAT GAN presents an effective, scalable approach to incorporate fine-grained supervisory signals into GAN-based text-to-image synthesis, with practical implications for applications requiring detailed and class-consistent imagery.

Abstract

Fine-grained text to image synthesis involves generating images from texts that belong to different categories. In contrast to general text to image synthesis, in fine-grained synthesis there is high similarity between images of different subclasses, and there may be linguistic discrepancy among texts describing the same image. Recent Generative Adversarial Networks (GAN), such as the Recurrent Affine Transformation (RAT) GAN model, are able to synthesize clear and realistic images from texts. However, GAN models ignore fine-grained level information. In this paper we propose an approach that incorporates an auxiliary classifier in the discriminator and a contrastive learning method to improve the accuracy of fine-grained details in images synthesized by RAT GAN. The auxiliary classifier helps the discriminator classify the class of images, and helps the generator synthesize more accurate fine-grained images. The contrastive learning method minimizes the similarity between images from different subclasses and maximizes the similarity between images from the same subclass. We evaluate on several state-of-the-art methods on the commonly used CUB-200-2011 bird dataset and Oxford-102 flower dataset, and demonstrated superior performance.

Paper Structure

This paper contains 16 sections, 10 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The original discrminator in Figure (a) computes GAN loss. The discriminator with auxliary classifier in Figure (b) computes categorical cross entropy loss. The discrminator with contrastive learning in Figure (c) computes contrastive learning loss.
  • Figure 2: The structure of the discriminator with auxiliary classifier and contrastive learning. The original output of the discriminator is still used to compute the GAN loss, and meanwhile followed by one fully connected layer to decrease the feature dimension. Next, the fully connected layer is followed by one embedding layer for contrastive learning. Then, the embedding layer is followed by a classifier for image classification.
  • Figure 3: Examples of generated images using RAT GAN and the proposed FG-RAT GAN on the CUB bird dataset. Each row represents a different sample (image size = 256x256) and with the corresponding caption below.The first column is image class and name. The second column is the corresponding target image. The rest of other columns are the generated images from LAFITE, VQ-Diffusion, RAT GAN, and our FG-RAT GAN. As we can see, our FG-RAT GAN can generate more realistic images where each image is similar to other images within the same class.
  • Figure 4: Examples of generated images using RAT GAN and the proposed FG-RAT GAN with classifier and contrastive learning trained on the Oxford flower dataset. Each row represents a different sample (image size=256x256). The first column is the sample detail including class and specific image name. The second column is the caption. The third column is the corresponding target image. The fourth column is the image generated by RAT GAN. The fifth column is the image generated by our proposed FG-RAT GAN. As we can see, our proposed FG-RAT GAN can generate more realistic images where each image is similar to other images within the same class.
  • Figure 5: Examples of generated images using DALLE-2, Stable Diffusion, and the proposed FG-RAT GAN trained on the CUB bird dataset. Each row represents a different sample (image size=256x256). The first column is the sample detail including class and specific image name. The second column is the corresponding target image. The third column is a generated image from DALLE-2. The fourth column is a generated image form Stable Diffusion. The fifth column is a generated image from our proposed FG-RAT GAN. As we can see, our proposed FG-RAT GAN can generate more realistic images where each image is similar to other images within the same class. For example, in the 1st row the proposed FG-RAT GAN generates a bird with dark brown body and white band encircling near the bill as specified in the caption, in the 3rd row it generates a bird with all gray body as specified in the caption, and both examples are similar to each other given that they belong to the same class.
  • ...and 1 more figures