Table of Contents
Fetching ...

Alchemist: Turning Public Text-to-Image Data into Generative Gold

Valerii Startsev, Alexander Ustyuzhanin, Alexey Kirillov, Dmitry Baranchuk, Sergey Kastryulin

TL;DR

This work tackles the challenge of creating high-quality, general-purpose supervised fine-tuning data for text-to-image models. It introduces a principled, diffusion-guided pipeline that uses a pre-trained diffusion model as a quality estimator, culminating in the Alchemist dataset of 3,350 high-impact image-text pairs and open-source fine-tuned weights for five SD models. Empirical results show consistent aesthetic and complexity improvements across multiple public models, with modest fidelity trade-offs and no meaningful loss in image-text relevance. The approach demonstrates that compact, carefully filtered datasets can rival larger, publicly available SFT resources, enabling reproducible advances in open T2I research.

Abstract

Pre-training equips text-to-image (T2I) models with broad world knowledge, but this alone is often insufficient to achieve high aesthetic quality and alignment. Consequently, supervised fine-tuning (SFT) is crucial for further refinement. However, its effectiveness highly depends on the quality of the fine-tuning dataset. Existing public SFT datasets frequently target narrow domains (e.g., anime or specific art styles), and the creation of high-quality, general-purpose SFT datasets remains a significant challenge. Current curation methods are often costly and struggle to identify truly impactful samples. This challenge is further complicated by the scarcity of public general-purpose datasets, as leading models often rely on large, proprietary, and poorly documented internal data, hindering broader research progress. This paper introduces a novel methodology for creating general-purpose SFT datasets by leveraging a pre-trained generative model as an estimator of high-impact training samples. We apply this methodology to construct and release Alchemist, a compact (3,350 samples) yet highly effective SFT dataset. Experiments demonstrate that Alchemist substantially improves the generative quality of five public T2I models while preserving diversity and style. Additionally, we release the fine-tuned models' weights to the public.

Alchemist: Turning Public Text-to-Image Data into Generative Gold

TL;DR

This work tackles the challenge of creating high-quality, general-purpose supervised fine-tuning data for text-to-image models. It introduces a principled, diffusion-guided pipeline that uses a pre-trained diffusion model as a quality estimator, culminating in the Alchemist dataset of 3,350 high-impact image-text pairs and open-source fine-tuned weights for five SD models. Empirical results show consistent aesthetic and complexity improvements across multiple public models, with modest fidelity trade-offs and no meaningful loss in image-text relevance. The approach demonstrates that compact, carefully filtered datasets can rival larger, publicly available SFT resources, enabling reproducible advances in open T2I research.

Abstract

Pre-training equips text-to-image (T2I) models with broad world knowledge, but this alone is often insufficient to achieve high aesthetic quality and alignment. Consequently, supervised fine-tuning (SFT) is crucial for further refinement. However, its effectiveness highly depends on the quality of the fine-tuning dataset. Existing public SFT datasets frequently target narrow domains (e.g., anime or specific art styles), and the creation of high-quality, general-purpose SFT datasets remains a significant challenge. Current curation methods are often costly and struggle to identify truly impactful samples. This challenge is further complicated by the scarcity of public general-purpose datasets, as leading models often rely on large, proprietary, and poorly documented internal data, hindering broader research progress. This paper introduces a novel methodology for creating general-purpose SFT datasets by leveraging a pre-trained generative model as an estimator of high-impact training samples. We apply this methodology to construct and release Alchemist, a compact (3,350 samples) yet highly effective SFT dataset. Experiments demonstrate that Alchemist substantially improves the generative quality of five public T2I models while preserving diversity and style. Additionally, we release the fine-tuned models' weights to the public.

Paper Structure

This paper contains 48 sections, 17 figures, 7 tables, 1 algorithm.

Figures (17)

  • Figure 1: Images generated by Stable Diffusion 3.5 Large fine-tuned on Alchemist, demonstrating enhanced aesthetic quality and complexity while maintaining prompt adherence.
  • Figure 2: Overview of the multi-stage image filtering pipeline. Beginning with a web-scale collection of raw data, the pipeline sequentially filters images to isolate a high-quality subset optimally suited for supervised fine-tuning of text-to-image models.
  • Figure 3: Examples of images generated by five Stable Diffusion models for the prompt "Mars rises on the horizon." before and after tuning on Alchemist.
  • Figure 4: Comparison of models fine-tuned on Alchemist variants of different sizes. The table reports human win rates (by aspect) of Alchemist-3k-tuned models against models tuned on 7k and 19k variants of Alchemist. Green indicates statistically significant improvement ($p < 0.05$), gray no significant change, and red a statistically significant decline.
  • Figure 5: Results of SbS comparison of SDXL, SD3.5 Medium before and after fine-tuning versus FLUX. Grey shaded region shows the interval of statistical insignificance.
  • ...and 12 more figures