Table of Contents
Fetching ...

YaART: Yet Another ART Rendering Technology

Sergey Kastryulin, Artem Konev, Alexander Shishenya, Eugene Lyapustin, Artem Khurshudov, Alexander Tselousov, Nikita Vinokurov, Denis Kuznedelev, Alexander Markovich, Grigoriy Livshits, Alexey Kirillov, Anastasiia Tabisheva, Liubov Chubarova, Marina Kaminskaia, Alexander Ustyuzhanin, Artemii Shvetsov, Daniil Shlenskii, Valerii Startsev, Dmitrii Kornilov, Mikhail Romanov, Artem Babenko, Sergei Ovcharenko, Valentin Khrulkov

TL;DR

YaART introduces a production-grade, RLHF-tuned cascaded diffusion model for text-to-image generation and systematically analyzes how model size, dataset quality, and training compute interact to shape final image quality. The approach combines a three-stage cascade (GEN64, SR256, SR1024), large-scale pre-training on curated high-quality data, supervised fine-tuning, and RL alignment with multiple reward signals to optimize realism, alignment, and aesthetics. Key findings show that smaller, high-quality datasets can rival larger ones, that increasing model size generally improves quality given enough compute, and that RLHF significantly enhances perceptual attributes beyond dataset-only improvements. The work demonstrates practical guidance for production-scale diffusion training and positions YaART as competitive with leading public benchmarks, with implications for data curation strategies and RL-based refinement in real-world deployments.

Abstract

In the rapidly progressing field of generative models, the development of efficient and high-fidelity text-to-image diffusion systems represents a significant frontier. This study introduces YaART, a novel production-grade text-to-image cascaded diffusion model aligned to human preferences using Reinforcement Learning from Human Feedback (RLHF). During the development of YaART, we especially focus on the choices of the model and training dataset sizes, the aspects that were not systematically investigated for text-to-image cascaded diffusion models before. In particular, we comprehensively analyze how these choices affect both the efficiency of the training process and the quality of the generated images, which are highly important in practice. Furthermore, we demonstrate that models trained on smaller datasets of higher-quality images can successfully compete with those trained on larger datasets, establishing a more efficient scenario of diffusion models training. From the quality perspective, YaART is consistently preferred by users over many existing state-of-the-art models.

YaART: Yet Another ART Rendering Technology

TL;DR

YaART introduces a production-grade, RLHF-tuned cascaded diffusion model for text-to-image generation and systematically analyzes how model size, dataset quality, and training compute interact to shape final image quality. The approach combines a three-stage cascade (GEN64, SR256, SR1024), large-scale pre-training on curated high-quality data, supervised fine-tuning, and RL alignment with multiple reward signals to optimize realism, alignment, and aesthetics. Key findings show that smaller, high-quality datasets can rival larger ones, that increasing model size generally improves quality given enough compute, and that RLHF significantly enhances perceptual attributes beyond dataset-only improvements. The work demonstrates practical guidance for production-scale diffusion training and positions YaART as competitive with leading public benchmarks, with implications for data curation strategies and RL-based refinement in real-world deployments.

Abstract

In the rapidly progressing field of generative models, the development of efficient and high-fidelity text-to-image diffusion systems represents a significant frontier. This study introduces YaART, a novel production-grade text-to-image cascaded diffusion model aligned to human preferences using Reinforcement Learning from Human Feedback (RLHF). During the development of YaART, we especially focus on the choices of the model and training dataset sizes, the aspects that were not systematically investigated for text-to-image cascaded diffusion models before. In particular, we comprehensively analyze how these choices affect both the efficiency of the training process and the quality of the generated images, which are highly important in practice. Furthermore, we demonstrate that models trained on smaller datasets of higher-quality images can successfully compete with those trained on larger datasets, establishing a more efficient scenario of diffusion models training. From the quality perspective, YaART is consistently preferred by users over many existing state-of-the-art models.
Paper Structure (30 sections, 12 figures, 4 tables)

This paper contains 30 sections, 12 figures, 4 tables.

Figures (12)

  • Figure 1: RL-aligned YaART generates visually pleasing and highly consistent images.
  • Figure 2: An evolution of rewards (left) leads to an increase of human preference rate (right) throughout the RL alignment stage.
  • Figure 3: The content of YaBasket. The three major prompt categories (left) include Products, almost equally split into eight sub-categories (right).
  • Figure 4: Scaling up the convolutional GEN64 model improves training, leading to higher quality models. Larger models train faster regarding training steps and GPU hours, leading to better results across different dataset sizes (top and middle rows). Dataset size weakly influences the model's end quality (bottom row).
  • Figure 5: Dynamics of side-by-side comparisons of the half-size model with Stable Diffusion v1.4 rombach2022_LDM (left) and the fully sized model pre-train (right). Each point shows the mean and standard deviation between three independent human evaluation experiments. Note the rapid quality growth through the first few hundred iterations, after which performance reaches a plateau. Given enough compute, the test model is capable of surpassing the Stable Diffusion quality, while the performance of the fully sized YaART model remains unsurpassed even with more compute.
  • ...and 7 more figures