Table of Contents
Fetching ...

RATLIP: Generative Adversarial CLIP Text-to-Image Synthesis Based on Recurrent Affine Transformations

Chengde Lin, Xijun Lu, Guangxi Chen

TL;DR

The paper addresses the difficulty of generating text-consistent, high-fidelity images by integrating CLIP guidance with a recurrent, text-aware fusion mechanism. It introduces Recurrent Affine Transformations (RAT) implemented with LSTM cells to propagate global textual information across fusion blocks, and interleaves Shuffle Attention (SA) between RAT blocks to slow forgetting and stabilize memory. A CLIP-based discriminator and frozen CLIP-ViT features guide the synthesis, yielding strong cross-modal supervision. Across CUB, Oxford, and CelebA-tiny, RATLIP achieves state-of-the-art CLIP-Score and competitive or superior FID, with ablations confirming the contributions of RAT and SA to performance gains.

Abstract

Synthesizing high-quality photorealistic images with textual descriptions as a condition is very challenging. Generative Adversarial Networks (GANs), the classical model for this task, frequently suffer from low consistency between image and text descriptions and insufficient richness in synthesized images. Recently, conditional affine transformations (CAT), such as conditional batch normalization and instance normalization, have been applied to different layers of GAN to control content synthesis in images. CAT is a multi-layer perceptron that independently predicts data based on batch statistics between neighboring layers, with global textual information unavailable to other layers. To address this issue, we first model CAT and a recurrent neural network (RAT) to ensure that different layers can access global information. We then introduce shuffle attention between RAT to mitigate the characteristic of information forgetting in recurrent neural networks. Moreover, both our generator and discriminator utilize the powerful pre-trained model, Clip, which has been extensively employed for establishing associations between text and images through the learning of multimodal representations in latent space. The discriminator utilizes CLIP's ability to comprehend complex scenes to accurately assess the quality of the generated images. Extensive experiments have been conducted on the CUB, Oxford, and CelebA-tiny datasets to demonstrate the superiority of the proposed model over current state-of-the-art models. The code is https://github.com/OxygenLu/RATLIP.

RATLIP: Generative Adversarial CLIP Text-to-Image Synthesis Based on Recurrent Affine Transformations

TL;DR

The paper addresses the difficulty of generating text-consistent, high-fidelity images by integrating CLIP guidance with a recurrent, text-aware fusion mechanism. It introduces Recurrent Affine Transformations (RAT) implemented with LSTM cells to propagate global textual information across fusion blocks, and interleaves Shuffle Attention (SA) between RAT blocks to slow forgetting and stabilize memory. A CLIP-based discriminator and frozen CLIP-ViT features guide the synthesis, yielding strong cross-modal supervision. Across CUB, Oxford, and CelebA-tiny, RATLIP achieves state-of-the-art CLIP-Score and competitive or superior FID, with ablations confirming the contributions of RAT and SA to performance gains.

Abstract

Synthesizing high-quality photorealistic images with textual descriptions as a condition is very challenging. Generative Adversarial Networks (GANs), the classical model for this task, frequently suffer from low consistency between image and text descriptions and insufficient richness in synthesized images. Recently, conditional affine transformations (CAT), such as conditional batch normalization and instance normalization, have been applied to different layers of GAN to control content synthesis in images. CAT is a multi-layer perceptron that independently predicts data based on batch statistics between neighboring layers, with global textual information unavailable to other layers. To address this issue, we first model CAT and a recurrent neural network (RAT) to ensure that different layers can access global information. We then introduce shuffle attention between RAT to mitigate the characteristic of information forgetting in recurrent neural networks. Moreover, both our generator and discriminator utilize the powerful pre-trained model, Clip, which has been extensively employed for establishing associations between text and images through the learning of multimodal representations in latent space. The discriminator utilizes CLIP's ability to comprehend complex scenes to accurately assess the quality of the generated images. Extensive experiments have been conducted on the CUB, Oxford, and CelebA-tiny datasets to demonstrate the superiority of the proposed model over current state-of-the-art models. The code is https://github.com/OxygenLu/RATLIP.
Paper Structure (16 sections, 9 equations, 5 figures, 3 tables)

This paper contains 16 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The proposed general framework, RATLIP, for text-to-image synthesis. By integrating RATBLK and SAConv into the generator, our model can effectively understand the context.
  • Figure 2: RAT-Block structure. (a) DFBlock in Baseline (b) RAT Block, where the L-module is an LSTM cell. (c) LSTM Cell.
  • Figure 3: Image synthesized by LSTM with different hidden layers h in CUB dataset visualization result under CAM. The image starts from groud truth(h=0) and ends at h=128.
  • Figure 4: Visualization comparison of the dataset CelebA-tiny,Oxford,CUB and current state-of-the-art models.
  • Figure 5: Semantic spatial feature qualitative analysis. The face images in (a),(b),(c),(d) are GT, baseline, Ours(young),Ours(Old). On the right, there is a latent space containing vectors representing four image features. The rule states that when you add vectors together, the result is a vector in another feature.