Table of Contents
Fetching ...

A Novel Sampling Scheme for Text- and Image-Conditional Image Synthesis in Quantized Latent Spaces

Dominic Rampas, Pablo Pernias, Marc Aubreville

TL;DR

Paella tackles the problem of high compute cost in text-to-image generation by introducing a convolutional, token-based generator that operates in a discretized latent space produced by a VQGAN with compression $f=4$. It employs a novel token noising and denoising training objective along with a 12-step iterative sampling scheme that uses classifier-free guidance and ByT5-XL plus CLIP conditioning to produce text- and image-conditioned outputs. The method achieves competitive zero-shot FID on COCO with a 1B parameter model while using far fewer sampling steps than diffusion or transformer-based models, and it benefits from simple training and sampling paradigms that are easier to reproduce. The authors release the code and model weights under the MIT license, aiming to democratize access to high-quality text-to-image synthesis and enable downstream applications such as image variations and mixed conditioning.

Abstract

Recent advancements in the domain of text-to-image synthesis have culminated in a multitude of enhancements pertaining to quality, fidelity, and diversity. Contemporary techniques enable the generation of highly intricate visuals which rapidly approach near-photorealistic quality. Nevertheless, as progress is achieved, the complexity of these methodologies increases, consequently intensifying the comprehension barrier between individuals within the field and those external to it. In an endeavor to mitigate this disparity, we propose a streamlined approach for text-to-image generation, which encompasses both the training paradigm and the sampling process. Despite its remarkable simplicity, our method yields aesthetically pleasing images with few sampling iterations, allows for intriguing ways for conditioning the model, and imparts advantages absent in state-of-the-art techniques. To demonstrate the efficacy of this approach in achieving outcomes comparable to existing works, we have trained a one-billion parameter text-conditional model, which we refer to as "Paella". In the interest of fostering future exploration in this field, we have made our source code and models publicly accessible for the research community.

A Novel Sampling Scheme for Text- and Image-Conditional Image Synthesis in Quantized Latent Spaces

TL;DR

Paella tackles the problem of high compute cost in text-to-image generation by introducing a convolutional, token-based generator that operates in a discretized latent space produced by a VQGAN with compression . It employs a novel token noising and denoising training objective along with a 12-step iterative sampling scheme that uses classifier-free guidance and ByT5-XL plus CLIP conditioning to produce text- and image-conditioned outputs. The method achieves competitive zero-shot FID on COCO with a 1B parameter model while using far fewer sampling steps than diffusion or transformer-based models, and it benefits from simple training and sampling paradigms that are easier to reproduce. The authors release the code and model weights under the MIT license, aiming to democratize access to high-quality text-to-image synthesis and enable downstream applications such as image variations and mixed conditioning.

Abstract

Recent advancements in the domain of text-to-image synthesis have culminated in a multitude of enhancements pertaining to quality, fidelity, and diversity. Contemporary techniques enable the generation of highly intricate visuals which rapidly approach near-photorealistic quality. Nevertheless, as progress is achieved, the complexity of these methodologies increases, consequently intensifying the comprehension barrier between individuals within the field and those external to it. In an endeavor to mitigate this disparity, we propose a streamlined approach for text-to-image generation, which encompasses both the training paradigm and the sampling process. Despite its remarkable simplicity, our method yields aesthetically pleasing images with few sampling iterations, allows for intriguing ways for conditioning the model, and imparts advantages absent in state-of-the-art techniques. To demonstrate the efficacy of this approach in achieving outcomes comparable to existing works, we have trained a one-billion parameter text-conditional model, which we refer to as "Paella". In the interest of fostering future exploration in this field, we have made our source code and models publicly accessible for the research community.
Paper Structure (14 sections, 3 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 14 sections, 3 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: Visual results of our proposed method and trained model. It is able to perform a variety of image synthesis tasks. The left hand side and the top right shows our model's abilities on text-conditional image generation on different sizes while the two bottom right panels show image- and combined text and image-conditioning.
  • Figure 2: Visual depiction of the overall architecture of our proposed method. Training of Paella operates on a compressed latent space. Latent images are noised and the model is optimized to predict the unnoised version of the image.
  • Figure 3: Sampling mechanism for the token predictor of our model.
  • Figure 4: a) Illustrative comparison between a single-step argmax denoising using masked tokens and random noise. The former (illustrated in the top row) always results in the same output, whereas random tokens give different outputs (bottom row), showing an intrinsic induced diversity in our method, while in a masked setting diversity needs to be induced in sampling. b) Comparison between low confidence renoising as used in MUSE (top) vs. our proposed random renoising (bottom).
  • Figure 5: Dependency of the a) zero-shot Fréchet Inception Distance (FID) score heusel2017gans and the b) CLIP score on the number of sampling steps.
  • ...and 1 more figures