Table of Contents
Fetching ...

Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion

Anton Razzhigaev, Arseniy Shakhmatov, Anastasia Maltseva, Vladimir Arkhipkin, Igor Pavlov, Ilya Ryabov, Angelina Kuts, Alexander Panchenko, Andrey Kuznetsov, Denis Dimitrov

TL;DR

Kandinsky introduces a latent-diffusion text-to-image model that leverages an image-prior to map CLIP-based text embeddings to image embeddings, combined with a latent UNet diffusion backbone and a Sber-MoVQGAN decoder. The architecture uses frozen CLIP/VLM encoders (CLIP-text and XLM-R) and CLIP-image guidance, with an image-prior trained on CLIP embeddings to bridge text and image spaces. It achieves competitive open-source performance (FID-CLIP 8.03 on COCO-30K) and is supported by extensive ablations, a web/Telegram demo, and open-source releases. The work points to efficient training and strong generation quality, while outlining future directions in higher resolutions, improved encoders, editing capabilities, and robust content moderation.

Abstract

Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality.

Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion

TL;DR

Kandinsky introduces a latent-diffusion text-to-image model that leverages an image-prior to map CLIP-based text embeddings to image embeddings, combined with a latent UNet diffusion backbone and a Sber-MoVQGAN decoder. The architecture uses frozen CLIP/VLM encoders (CLIP-text and XLM-R) and CLIP-image guidance, with an image-prior trained on CLIP embeddings to bridge text and image spaces. It achieves competitive open-source performance (FID-CLIP 8.03 on COCO-30K) and is supported by extensive ablations, a web/Telegram demo, and open-source releases. The work points to efficient training and strong generation quality, while outlining future directions in higher resolutions, improved encoders, editing capabilities, and robust content moderation.

Abstract

Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality.
Paper Structure (10 sections, 7 figures, 4 tables)

This paper contains 10 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Image prior scheme and inference regimes of the Kandinsky model.
  • Figure 2: Examples of inference regimes using Kandinsky model.
  • Figure 3: Kandinsky web interface for "a corgi gliding on the wave": generation (left) and in/outpainting (right).
  • Figure 4: CLIP-FID curves for different setups.
  • Figure 5: Image generation results with prompt "astronaut riding a horse" for original image prior and linear prior trained on 500 pairs of images with cats.
  • ...and 2 more figures