Fine-grained Text to Image Synthesis

Xu Ouyang; Ying Chen; Kaiyue Zhu; Gady Agam

Fine-grained Text to Image Synthesis

Xu Ouyang, Ying Chen, Kaiyue Zhu, Gady Agam

TL;DR

This work tackles fine-grained text-to-image synthesis by enhancing the RAT GAN with a discriminator-side auxiliary classifier and a contrastive learning objective using cross-batch memory. The auxiliary classifier provides category-level supervision and helps the generator produce finer-grained details, while the contrastive losses enforce intra-class similarity and inter-class separation across real and generated images. Empirical results on CUB-200-2011 and Oxford-102 show improved FID with competitive IS, demonstrating stronger realism and semantic fidelity at the fine-grained level with only a modest parameter increase. Overall, FG-RAT GAN presents an effective, scalable approach to incorporate fine-grained supervisory signals into GAN-based text-to-image synthesis, with practical implications for applications requiring detailed and class-consistent imagery.

Abstract

Fine-grained text to image synthesis involves generating images from texts that belong to different categories. In contrast to general text to image synthesis, in fine-grained synthesis there is high similarity between images of different subclasses, and there may be linguistic discrepancy among texts describing the same image. Recent Generative Adversarial Networks (GAN), such as the Recurrent Affine Transformation (RAT) GAN model, are able to synthesize clear and realistic images from texts. However, GAN models ignore fine-grained level information. In this paper we propose an approach that incorporates an auxiliary classifier in the discriminator and a contrastive learning method to improve the accuracy of fine-grained details in images synthesized by RAT GAN. The auxiliary classifier helps the discriminator classify the class of images, and helps the generator synthesize more accurate fine-grained images. The contrastive learning method minimizes the similarity between images from different subclasses and maximizes the similarity between images from the same subclass. We evaluate on several state-of-the-art methods on the commonly used CUB-200-2011 bird dataset and Oxford-102 flower dataset, and demonstrated superior performance.

Fine-grained Text to Image Synthesis

TL;DR

Abstract

Fine-grained Text to Image Synthesis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)