Table of Contents
Fetching ...

Kandinsky 3: Text-to-Image Synthesis for Multifunctional Generative Framework

Vladimir Arkhipkin, Viacheslav Vasilev, Andrei Filatov, Igor Pavlov, Julia Agafonova, Nikolai Gerasimenko, Anna Averchenkova, Evelina Mironova, Anton Bukashkin, Konstantin Kulikov, Andrey Kuznetsov, Denis Dimitrov

TL;DR

This work presents Kandinsky 3, a novel T2I model based on latent diffusion, achieving a high level of quality and photorealism and creates a multifunctional generation system that includes text-guided inpainting/outpainting, image fusion, text-image fusion, image variations generation, I2V and T2V generation.

Abstract

Text-to-image (T2I) diffusion models are popular for introducing image manipulation methods, such as editing, image fusion, inpainting, etc. At the same time, image-to-video (I2V) and text-to-video (T2V) models are also built on top of T2I models. We present Kandinsky 3, a novel T2I model based on latent diffusion, achieving a high level of quality and photorealism. The key feature of the new architecture is the simplicity and efficiency of its adaptation for many types of generation tasks. We extend the base T2I model for various applications and create a multifunctional generation system that includes text-guided inpainting/outpainting, image fusion, text-image fusion, image variations generation, I2V and T2V generation. We also present a distilled version of the T2I model, evaluating inference in 4 steps of the reverse process without reducing image quality and 3 times faster than the base model. We deployed a user-friendly demo system in which all the features can be tested in the public domain. Additionally, we released the source code and checkpoints for the Kandinsky 3 and extended models. Human evaluations show that Kandinsky 3 demonstrates one of the highest quality scores among open source generation systems.

Kandinsky 3: Text-to-Image Synthesis for Multifunctional Generative Framework

TL;DR

This work presents Kandinsky 3, a novel T2I model based on latent diffusion, achieving a high level of quality and photorealism and creates a multifunctional generation system that includes text-guided inpainting/outpainting, image fusion, text-image fusion, image variations generation, I2V and T2V generation.

Abstract

Text-to-image (T2I) diffusion models are popular for introducing image manipulation methods, such as editing, image fusion, inpainting, etc. At the same time, image-to-video (I2V) and text-to-video (T2V) models are also built on top of T2I models. We present Kandinsky 3, a novel T2I model based on latent diffusion, achieving a high level of quality and photorealism. The key feature of the new architecture is the simplicity and efficiency of its adaptation for many types of generation tasks. We extend the base T2I model for various applications and create a multifunctional generation system that includes text-guided inpainting/outpainting, image fusion, text-image fusion, image variations generation, I2V and T2V generation. We also present a distilled version of the T2I model, evaluating inference in 4 steps of the reverse process without reducing image quality and 3 times faster than the base model. We deployed a user-friendly demo system in which all the features can be tested in the public domain. Additionally, we released the source code and checkpoints for the Kandinsky 3 and extended models. Human evaluations show that Kandinsky 3 demonstrates one of the highest quality scores among open source generation systems.

Paper Structure

This paper contains 30 sections, 15 figures, 1 table.

Figures (15)

  • Figure 2: Architecture of the text-to-image model Kandinsky 3. It consists of a text encoder, a latent conditioned diffusion U-Net, and an image decoder.
  • Figure 3: Inference regimes of Kandinsky 3 model.
  • Figure 4: Image-to-Video generation. The input image undergoes a right shift transformation. The result enters the image-to-image process to eliminate transformation artifacts and update the semantic content guided by the text prompt.
  • Figure 5: Human evaluation results on DrawBench saharia2022photorealistic.
  • Figure 6: Kandinsky 3 U-Net architecture. The architecture is based on modified BigGAN-deep blocks (left and right -- downsample and upsample blocks), which allows us to increase the depth of the architecture due to the presence of bottlenecks. The attention layers are arranged at levels with a lower resolution than the original image.
  • ...and 10 more figures