Table of Contents
Fetching ...

ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations

Maitreya Patel, Changhoon Kim, Sheng Cheng, Chitta Baral, Yezhou Yang

TL;DR

This work tackles the high resource demands of unCLIP text-to-image generation by introducing ECLIPSE, a CLIP-guided, non-diffusion T2I prior that distills knowledge from vision-language models into a compact 33–34M-parameter prior trained on a fraction of the data. The method combines a projection objective with a CLIP-based contrastive loss to align text and image latent spaces, enabling strong compositional capabilities with dramatically reduced data and compute. Empirical results show ECLIPSE achieves state-of-the-art-like compositional performance under resource constraints, approaching or matching larger models while using only about 2.8% of the training data and 3.3% of the parameters of conventional priors. Analyses indicate diffusion priors and added noise can hinder performance, underscoring the practical value of non-diffusion priors for efficient text-to-image synthesis.

Abstract

Text-to-image (T2I) diffusion models, notably the unCLIP models (e.g., DALL-E-2), achieve state-of-the-art (SOTA) performance on various compositional T2I benchmarks, at the cost of significant computational resources. The unCLIP stack comprises T2I prior and diffusion image decoder. The T2I prior model alone adds a billion parameters compared to the Latent Diffusion Models, which increases the computational and high-quality data requirements. We introduce ECLIPSE, a novel contrastive learning method that is both parameter and data-efficient. ECLIPSE leverages pre-trained vision-language models (e.g., CLIP) to distill the knowledge into the prior model. We demonstrate that the ECLIPSE trained prior, with only 3.3% of the parameters and trained on a mere 2.8% of the data, surpasses the baseline T2I priors with an average of 71.6% preference score under resource-limited setting. It also attains performance on par with SOTA big models, achieving an average of 63.36% preference score in terms of the ability to follow the text compositions. Extensive experiments on two unCLIP diffusion image decoders, Karlo and Kandinsky, affirm that ECLIPSE priors consistently deliver high performance while significantly reducing resource dependency.

ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations

TL;DR

This work tackles the high resource demands of unCLIP text-to-image generation by introducing ECLIPSE, a CLIP-guided, non-diffusion T2I prior that distills knowledge from vision-language models into a compact 33–34M-parameter prior trained on a fraction of the data. The method combines a projection objective with a CLIP-based contrastive loss to align text and image latent spaces, enabling strong compositional capabilities with dramatically reduced data and compute. Empirical results show ECLIPSE achieves state-of-the-art-like compositional performance under resource constraints, approaching or matching larger models while using only about 2.8% of the training data and 3.3% of the parameters of conventional priors. Analyses indicate diffusion priors and added noise can hinder performance, underscoring the practical value of non-diffusion priors for efficient text-to-image synthesis.

Abstract

Text-to-image (T2I) diffusion models, notably the unCLIP models (e.g., DALL-E-2), achieve state-of-the-art (SOTA) performance on various compositional T2I benchmarks, at the cost of significant computational resources. The unCLIP stack comprises T2I prior and diffusion image decoder. The T2I prior model alone adds a billion parameters compared to the Latent Diffusion Models, which increases the computational and high-quality data requirements. We introduce ECLIPSE, a novel contrastive learning method that is both parameter and data-efficient. ECLIPSE leverages pre-trained vision-language models (e.g., CLIP) to distill the knowledge into the prior model. We demonstrate that the ECLIPSE trained prior, with only 3.3% of the parameters and trained on a mere 2.8% of the data, surpasses the baseline T2I priors with an average of 71.6% preference score under resource-limited setting. It also attains performance on par with SOTA big models, achieving an average of 63.36% preference score in terms of the ability to follow the text compositions. Extensive experiments on two unCLIP diffusion image decoders, Karlo and Kandinsky, affirm that ECLIPSE priors consistently deliver high performance while significantly reducing resource dependency.
Paper Structure (23 sections, 5 equations, 21 figures, 4 tables)

This paper contains 23 sections, 5 equations, 21 figures, 4 tables.

Figures (21)

  • Figure 1: Comparison between SOTA text-to-image models with respect to their total number of parameters and the average performance on the three composition tasks (color, shape, and texture). ECLIPSE achieves better results with less number of parameters without requiring a large amount of training data. The shown ECLIPSE trains a T2I prior model (having only 33M parameters) using only 5M image-text pairs with Kandinsky decoder.
  • Figure 2: Standard T2I prior learning strategies (top) minimizes the mean squared error between the predicted vision embedding $\hat{z}_x$ w.r.t. the ground truth embedding $z_x$ with or without time-conditioning. This methodology cannot be generalized very well to the outside training distribution (such as Orange Square). The proposed ECLIPSE training methodology (bottom) utilizes the semantic alignment property between $z_x$ and $z_y$ with the use of contrastive learning, which improves the text-to-image prior generalization.
  • Figure 3: Qualitative result of our text-to-image prior, ECLIPSE, comparing with SOTA T2I model. Our prior model reduces the model parameter requirements (from 1 Billion $\rightarrow$ 33 Million) and data requirements (from 177 Million $\rightarrow$ 5 Million $\rightarrow~$ 0.6 Million). Given this restrictive setting, ECLIPSE performs close to its huge counterpart (i.e., Kandinsky v2.2) and even outperforms models trained on huge datasets (i.e., Wurstchen, SDv1.4, and SDv2.1) in terms of compositions.
  • Figure 4: Qualitative evaluations by human preferences approximated by the PickScore kirstain2023pick. The top two figures compare ECLIPSE to Projection and Diffusion Baselines trained with the same amount of data and model size for both Karlo and Kandinsky decoders. In the bottom figure, we compare ECLIPSE with the Kandinsky v2.2 decoder trained on the LAION-HighRes dataset against SOTA models.
  • Figure 5: Empirical analysis of the PickScore preferences of diffusion priors with respect to the various hyper-parameters.
  • ...and 16 more figures