Table of Contents
Fetching ...

GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis

Ming Tao, Bing-Kun Bao, Hao Tang, Changsheng Xu

TL;DR

GALIP introduces Generative Adversarial CLIPs, a text-to-image framework that tightly integrates CLIP into both the discriminator and generator to achieve high-fidelity synthesis with drastically reduced data and parameter requirements and substantially faster generation. By freezing CLIP-ViT features and employing dedicated Mate-D and Mate-G modules (with Bridge Feature Predictor and Prompt Predictor), GALIP delivers strong image quality and domain generalization while maintaining a smooth GAN-style latent space for controllable styling. Quantitative and qualitative results on CUB, COCO, CC3M, and CC12M demonstrate competitive or superior performance against large autoregressive and diffusion models, with orders-of-magnitude faster inference. The work highlights a productive integration of understanding and generation, pointing to future avenues for compact, versatile large-scale models that couple perception with synthesis.

Abstract

Synthesizing high-fidelity complex images from text is challenging. Based on large pretraining, the autoregressive and diffusion models can synthesize photo-realistic images. Although these large models have shown notable progress, there remain three flaws. 1) These models require tremendous training data and parameters to achieve good performance. 2) The multi-step generation design slows the image synthesis process heavily. 3) The synthesized visual features are difficult to control and require delicately designed prompts. To enable high-quality, efficient, fast, and controllable text-to-image synthesis, we propose Generative Adversarial CLIPs, namely GALIP. GALIP leverages the powerful pretrained CLIP model both in the discriminator and generator. Specifically, we propose a CLIP-based discriminator. The complex scene understanding ability of CLIP enables the discriminator to accurately assess the image quality. Furthermore, we propose a CLIP-empowered generator that induces the visual concepts from CLIP through bridge features and prompts. The CLIP-integrated generator and discriminator boost training efficiency, and as a result, our model only requires about 3% training data and 6% learnable parameters, achieving comparable results to large pretrained autoregressive and diffusion models. Moreover, our model achieves 120 times faster synthesis speed and inherits the smooth latent space from GAN. The extensive experimental results demonstrate the excellent performance of our GALIP. Code is available at https://github.com/tobran/GALIP.

GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis

TL;DR

GALIP introduces Generative Adversarial CLIPs, a text-to-image framework that tightly integrates CLIP into both the discriminator and generator to achieve high-fidelity synthesis with drastically reduced data and parameter requirements and substantially faster generation. By freezing CLIP-ViT features and employing dedicated Mate-D and Mate-G modules (with Bridge Feature Predictor and Prompt Predictor), GALIP delivers strong image quality and domain generalization while maintaining a smooth GAN-style latent space for controllable styling. Quantitative and qualitative results on CUB, COCO, CC3M, and CC12M demonstrate competitive or superior performance against large autoregressive and diffusion models, with orders-of-magnitude faster inference. The work highlights a productive integration of understanding and generation, pointing to future avenues for compact, versatile large-scale models that couple perception with synthesis.

Abstract

Synthesizing high-fidelity complex images from text is challenging. Based on large pretraining, the autoregressive and diffusion models can synthesize photo-realistic images. Although these large models have shown notable progress, there remain three flaws. 1) These models require tremendous training data and parameters to achieve good performance. 2) The multi-step generation design slows the image synthesis process heavily. 3) The synthesized visual features are difficult to control and require delicately designed prompts. To enable high-quality, efficient, fast, and controllable text-to-image synthesis, we propose Generative Adversarial CLIPs, namely GALIP. GALIP leverages the powerful pretrained CLIP model both in the discriminator and generator. Specifically, we propose a CLIP-based discriminator. The complex scene understanding ability of CLIP enables the discriminator to accurately assess the image quality. Furthermore, we propose a CLIP-empowered generator that induces the visual concepts from CLIP through bridge features and prompts. The CLIP-integrated generator and discriminator boost training efficiency, and as a result, our model only requires about 3% training data and 6% learnable parameters, achieving comparable results to large pretrained autoregressive and diffusion models. Moreover, our model achieves 120 times faster synthesis speed and inherits the smooth latent space from GAN. The extensive experimental results demonstrate the excellent performance of our GALIP. Code is available at https://github.com/tobran/GALIP.
Paper Structure (13 sections, 1 equation, 9 figures, 3 tables)

This paper contains 13 sections, 1 equation, 9 figures, 3 tables.

Figures (9)

  • Figure 1: (a) Existing text-to-image GANs conduct adversarial training from scratch. (b) Our proposed GALIP conducts adversarial training based on the integrated CLIP model.
  • Figure 2: Comparing with Latent Diffusion Models (LDM) rombach2022high, our GALIP achieves comparable zero-shot Fréchet Inception Distance (ZS-FID) with measly 320M parameters (0.08B trainable parameters + 0.24B frozen CLIP parameters) and 12M training data. Furthermore, our GALIP only requires 0.04s to synthesize one image which is $\sim$120$\times$faster than LDM. Speed is calculated on NVIDIA 3090 GPU and Intel Xeon Silver 4314 CPU.
  • Figure 3: The architecture of the proposed GALIP for text-to-image synthesis. Armed with the CLIP-based discriminator and CLIP-empowered generator, our model can synthesize more realistic complex images.
  • Figure 4: The architecture of the proposed Mate-D for text-to-image synthesis. It further extracts informative visual features from collected CLIP features and assesses the image quality more accurately.
  • Figure 5: The architecture of the proposed CLIP-empowered generator for text-to-image synthesis. Armed with bridge feature predictor and prompt predictor, it can induce meaningful visual concepts from the frozen CLIP-ViT for image synthesis.
  • ...and 4 more figures