Table of Contents
Fetching ...

A Framework For Image Synthesis Using Supervised Contrastive Learning

Yibin Liu, Jianyu Zhang, Li Zhang, Shijian Li, Gang Pan

TL;DR

This work addresses the limitation of traditional text-to-image GANs that focus on inter-modal text-image alignment while neglecting inner-modal semantic distributions guided by labels. It introduces a label-guided supervised contrastive learning framework with two symmetric branches that operate in both pre-training and GAN phases to leverage both inter- and intra-class relationships, using shared encoders to align representations. The approach yields substantial gains on CUB and COCO across AttnGAN, DM-GAN, SSA-GAN, and GALIP, notably achieving large FID improvements on COCO (up to around 30%) and IS improvements on simpler datasets. The method outperforms existing label-guided alternatives like UniCL and cross-entropy and is broadly compatible with multiple T2I GAN architectures, suggesting wide applicability and potential for diffusion-model extensions in the future.

Abstract

Text-to-image (T2I) generation aims at producing realistic images corresponding to text descriptions. Generative Adversarial Network (GAN) has proven to be successful in this task. Typical T2I GANs are 2 phase methods that first pretrain an inter-modal representation from aligned image-text pairs and then use GAN to train image generator on that basis. However, such representation ignores the inner-modal semantic correspondence, e.g. the images with same label. The semantic label in priory describes the inherent distribution pattern with underlying cross-image relationships, which is supplement to the text description for understanding the full characteristics of image. In this paper, we propose a framework leveraging both inter- and inner-modal correspondence by label guided supervised contrastive learning. We extend the T2I GANs to two parameter-sharing contrast branches in both pretraining and generation phases. This integration effectively clusters the semantically similar image-text pair representations, thereby fostering the generation of higher-quality images. We demonstrate our framework on four novel T2I GANs by both single-object dataset CUB and multi-object dataset COCO, achieving significant improvements in the Inception Score (IS) and Frechet Inception Distance (FID) metrics of imagegeneration evaluation. Notably, on more complex multi-object COCO, our framework improves FID by 30.1%, 27.3%, 16.2% and 17.1% for AttnGAN, DM-GAN, SSA-GAN and GALIP, respectively. We also validate our superiority by comparing with other label guided T2I GANs. The results affirm the effectiveness and competitiveness of our approach in advancing the state-of-the-art GAN for T2I generation

A Framework For Image Synthesis Using Supervised Contrastive Learning

TL;DR

This work addresses the limitation of traditional text-to-image GANs that focus on inter-modal text-image alignment while neglecting inner-modal semantic distributions guided by labels. It introduces a label-guided supervised contrastive learning framework with two symmetric branches that operate in both pre-training and GAN phases to leverage both inter- and intra-class relationships, using shared encoders to align representations. The approach yields substantial gains on CUB and COCO across AttnGAN, DM-GAN, SSA-GAN, and GALIP, notably achieving large FID improvements on COCO (up to around 30%) and IS improvements on simpler datasets. The method outperforms existing label-guided alternatives like UniCL and cross-entropy and is broadly compatible with multiple T2I GAN architectures, suggesting wide applicability and potential for diffusion-model extensions in the future.

Abstract

Text-to-image (T2I) generation aims at producing realistic images corresponding to text descriptions. Generative Adversarial Network (GAN) has proven to be successful in this task. Typical T2I GANs are 2 phase methods that first pretrain an inter-modal representation from aligned image-text pairs and then use GAN to train image generator on that basis. However, such representation ignores the inner-modal semantic correspondence, e.g. the images with same label. The semantic label in priory describes the inherent distribution pattern with underlying cross-image relationships, which is supplement to the text description for understanding the full characteristics of image. In this paper, we propose a framework leveraging both inter- and inner-modal correspondence by label guided supervised contrastive learning. We extend the T2I GANs to two parameter-sharing contrast branches in both pretraining and generation phases. This integration effectively clusters the semantically similar image-text pair representations, thereby fostering the generation of higher-quality images. We demonstrate our framework on four novel T2I GANs by both single-object dataset CUB and multi-object dataset COCO, achieving significant improvements in the Inception Score (IS) and Frechet Inception Distance (FID) metrics of imagegeneration evaluation. Notably, on more complex multi-object COCO, our framework improves FID by 30.1%, 27.3%, 16.2% and 17.1% for AttnGAN, DM-GAN, SSA-GAN and GALIP, respectively. We also validate our superiority by comparing with other label guided T2I GANs. The results affirm the effectiveness and competitiveness of our approach in advancing the state-of-the-art GAN for T2I generation

Paper Structure

This paper contains 22 sections, 9 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Pre-training phase. Our data sampling strategy initiates two contrast branches with shared parameters to separately encode the image-text pairs of same label. The original Loss is consistent to the method our framework applied on. The supervised contrastive loss works on quadruple of image and text representations from both branches.
  • Figure 2: GAN training phase. Same as pre-training phase, we use two parameter-sharing T2I GAN branches to contrast the text-image pairs sharing same label. The supervised contrastive loss is performed on quadruple of text and generated fake image representations from two branches. In this phase, the pre-trained encoders are inference-only.
  • Figure 3: Qualitative comparison on CUB and COCO datasets for DM-GAN and SSA-GAN baselines w/o the utilization of our framework (denoted as "+SCL"). The input text descriptions are given in the first row and the corresponding generated images from different methods are shown in the same column. The left 4 columns are from CUB, and right 4 columns from COCO.