Table of Contents
Fetching ...

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev, Alexey Letunovskiy, Nikolai Vaulin, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Nikita Kiselev, Alexander Varlamov, Dmitrii Mikhailov, Vladimir Polovnikov, Andrey Shutkin, Julia Agafonova, Ilya Vasiliev, Anastasiia Kargapoltseva, Anna Dmitrienko, Anastasia Maltseva, Anna Averchenkova, Olga Kim, Tatiana Nikulina, Denis Dimitrov

TL;DR

Kandinsky 5.0 presents a versatile family of open-source foundation models for high-resolution image and 10-second video generation, spanning Image Lite, Video Lite, and Video Pro with 6B, 2B, and 19B parameters. It introduces CrossDiT, a diffusion transformer with NABLA sparse attention, and a data-centric pipeline that includes large-scale pretraining, self-supervised fine-tuning, distillation, and RL-based post-training to boost realism, prompt alignment, and temporal consistency. The work provides extensive data processing pipelines, multi-stage training, and multiple optimization strategies that yield state-of-the-art human-evaluated quality across several tasks, while offering open-source access to code and checkpoints. It also discusses limitations and ethical considerations, and frames Kandinsky 5.0 as a foundation toward democratizing high-quality multimodal generation for research and practical deployment.

Abstract

This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

TL;DR

Kandinsky 5.0 presents a versatile family of open-source foundation models for high-resolution image and 10-second video generation, spanning Image Lite, Video Lite, and Video Pro with 6B, 2B, and 19B parameters. It introduces CrossDiT, a diffusion transformer with NABLA sparse attention, and a data-centric pipeline that includes large-scale pretraining, self-supervised fine-tuning, distillation, and RL-based post-training to boost realism, prompt alignment, and temporal consistency. The work provides extensive data processing pipelines, multi-stage training, and multiple optimization strategies that yield state-of-the-art human-evaluated quality across several tasks, while offering open-source access to code and checkpoints. It also discusses limitations and ethical considerations, and frames Kandinsky 5.0 as a foundation toward democratizing high-quality multimodal generation for research and practical deployment.

Abstract

This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.

Paper Structure

This paper contains 87 sections, 5 equations, 47 figures, 5 tables.

Figures (47)

  • Figure 1: Kandinsky 5.0 Models Family
  • Figure 2: The evolution of Kandinsky models.
  • Figure 3: Data processing pipeline for Kandinsky T2V (text-to-video) and Kandinsky T2I (text-to-image) datasets. The workflow begins with raw image and video data, followed by initial filtering and deduplication it processed through advanced filtering (including watermark detection, quality assessment, complexity and text filtering), classification and content annotation stages. Final processed data is stored grouped by resolution (256, 512, and 1024 minimal side lengths) to use in correspondent pretrain stage.
  • Figure 4: Distribution of key data categories across the curated Kandinsky T2I dataset by Location, Main object and Picture style.
  • Figure 5: Instructive dataset processing pipeline. Initial image data (left) is processed through a set of similarity and exclusion criteria — including CLIP and DINO embeddings, face similarity, duplicate removal, and custom overlap heuristics — to produce filtered image-instruction pairs (right). Each retained image is paired with an Instruct token, forming high-quality instruction-tuned training samples.
  • ...and 42 more figures