Table of Contents
Fetching ...

CLIP-VQDiffusion : Langauge Free Training of Text To Image generation using CLIP and vector quantized diffusion model

Seungdae Han, Joohee Kim

TL;DR

CLIP-VQDiffusion tackles the data bottleneck in text-conditioned image generation by enabling language-free training that leverages CLIP's multimodal space and a vector-quantized diffusion model. The approach trains a VQ-GAN based visual tokenizer and a diffusion decoder conditioned on CLIP image embeddings, and performs text-conditioned inference via pseudo text embeddings and adaptive layer normalization. On FFHQ, it outperforms previous language-free methods by about 4.4% in clipscore and produces realistic images for in- and out-of-distribution prompts; COCO results are competitive with non-language-free baselines. The work reduces the need for paired text-image data and broadens the applicability of text-to-image generation, with pretrained models and code to follow.

Abstract

There has been a significant progress in text conditional image generation models. Recent advancements in this field depend not only on improvements in model structures, but also vast quantities of text-image paired datasets. However, creating these kinds of datasets is very costly and requires a substantial amount of labor. Famous face datasets don't have corresponding text captions, making it difficult to develop text conditional image generation models on these datasets. Some research has focused on developing text to image generation models using only images without text captions. Here, we propose CLIP-VQDiffusion, which leverage the pretrained CLIP model to provide multimodal text-image representations and strong image generation capabilities. On the FFHQ dataset, our model outperformed previous state-of-the-art methods by 4.4% in clipscore and generated very realistic images even when the text was both in and out of distribution. The pretrained models and codes will soon be available at https://github.com/INFINIQ-AI1/CLIPVQDiffusion

CLIP-VQDiffusion : Langauge Free Training of Text To Image generation using CLIP and vector quantized diffusion model

TL;DR

CLIP-VQDiffusion tackles the data bottleneck in text-conditioned image generation by enabling language-free training that leverages CLIP's multimodal space and a vector-quantized diffusion model. The approach trains a VQ-GAN based visual tokenizer and a diffusion decoder conditioned on CLIP image embeddings, and performs text-conditioned inference via pseudo text embeddings and adaptive layer normalization. On FFHQ, it outperforms previous language-free methods by about 4.4% in clipscore and produces realistic images for in- and out-of-distribution prompts; COCO results are competitive with non-language-free baselines. The work reduces the need for paired text-image data and broadens the applicability of text-to-image generation, with pretrained models and code to follow.

Abstract

There has been a significant progress in text conditional image generation models. Recent advancements in this field depend not only on improvements in model structures, but also vast quantities of text-image paired datasets. However, creating these kinds of datasets is very costly and requires a substantial amount of labor. Famous face datasets don't have corresponding text captions, making it difficult to develop text conditional image generation models on these datasets. Some research has focused on developing text to image generation models using only images without text captions. Here, we propose CLIP-VQDiffusion, which leverage the pretrained CLIP model to provide multimodal text-image representations and strong image generation capabilities. On the FFHQ dataset, our model outperformed previous state-of-the-art methods by 4.4% in clipscore and generated very realistic images even when the text was both in and out of distribution. The pretrained models and codes will soon be available at https://github.com/INFINIQ-AI1/CLIPVQDiffusion
Paper Structure (24 sections, 8 equations, 9 figures, 2 tables)

This paper contains 24 sections, 8 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: An overview of our CLIP-VQDiffusion approach. At the training stage, we embed input image to clip image embedding space, and also get clean latent code using image tokenizer. Conditioned on image embedding, our vector quantized diffusion model restore noisy latent code to clean latent code. At the inference stage, instead of the image embedding, CLIP text embedding is used as condition of our diffusion model to generate corresponding latent code.
  • Figure 2: Structure of our model. In the pretraining step, we train visual tokenizer using Gumbel softmax training method. In the training Step, we encode image using visual tokenizer to get image latent code. Then Diffusion Image Decoder learn to predict image latent given noisy image latent and timestep, and CLIP image embedding. At the inference step, we generate image latent from all masked latent code, using CLIP text embedding.
  • Figure 3: Structure of transformer block incorporating CLIP embedding
  • Figure 4: Explanation of Pseudo text embedding. Since CLIP image embedding $f_{\text{img}}(x)$ and text embedding $f_{\text{txt}}(t)$ located far from each other, we could add Gaussian noise to $f_{\text{img}}(x)$ and get pseudo text embedding $h'$ in the training step
  • Figure 5: Sample images generated from our method, clip2latent, Lafite and Clipgen. our method achieves high quality sample with great details. all images generated with fixed text prompt "A photograph of".
  • ...and 4 more figures