Table of Contents
Fetching ...

Fine-T2I: An Open, Large-Scale, and Diverse Dataset for High-Quality T2I Fine-Tuning

Xu Ma, Yitian Zhang, Qihua Dong, Yun Fu

TL;DR

Fine-T2I tackles the data bottleneck in open text-to-image fine-tuning by introducing a large open dataset that blends synthetic and real images, carefully filtered for text–image alignment and aesthetic quality. The authors build a detailed pipeline including prompt generation, semantic deduplication, safety and attribute filtering, prompt enhancement, and rigorous image–text filtering, yielding roughly 6 million examples totaling about 2 TB. Experiments across diffusion and autoregressive backbones demonstrate consistent improvements in visual quality and instruction adherence when fine-tuned on Fine-T2I, validated by human judgments and GenEval. The work provides an open, scalable foundation for improving open T2I models and closing the gap with production-grade systems.

Abstract

High-quality and open datasets remain a major bottleneck for text-to-image (T2I) fine-tuning. Despite rapid progress in model architectures and training pipelines, most publicly available fine-tuning datasets suffer from low resolution, poor text-image alignment, or limited diversity, resulting in a clear performance gap between open research models and enterprise-grade models. In this work, we present Fine-T2I, a large-scale, high-quality, and fully open dataset for T2I fine-tuning. Fine-T2I spans 10 task combinations, 32 prompt categories, 11 visual styles, and 5 prompt templates, and combines synthetic images generated by strong modern models with carefully curated real images from professional photographers. All samples are rigorously filtered for text-image alignment, visual fidelity, and prompt quality, with over 95% of initial candidates removed. The final dataset contains over 6 million text-image pairs, around 2 TB on disk, approaching the scale of pretraining datasets while maintaining fine-tuning-level quality. Across a diverse set of pretrained diffusion and autoregressive models, fine-tuning on Fine-T2I consistently improves both generation quality and instruction adherence, as validated by human evaluation, visual comparison, and automatic metrics. We release Fine-T2I under an open license to help close the data gap in T2I fine-tuning in the open community.

Fine-T2I: An Open, Large-Scale, and Diverse Dataset for High-Quality T2I Fine-Tuning

TL;DR

Fine-T2I tackles the data bottleneck in open text-to-image fine-tuning by introducing a large open dataset that blends synthetic and real images, carefully filtered for text–image alignment and aesthetic quality. The authors build a detailed pipeline including prompt generation, semantic deduplication, safety and attribute filtering, prompt enhancement, and rigorous image–text filtering, yielding roughly 6 million examples totaling about 2 TB. Experiments across diffusion and autoregressive backbones demonstrate consistent improvements in visual quality and instruction adherence when fine-tuned on Fine-T2I, validated by human judgments and GenEval. The work provides an open, scalable foundation for improving open T2I models and closing the gap with production-grade systems.

Abstract

High-quality and open datasets remain a major bottleneck for text-to-image (T2I) fine-tuning. Despite rapid progress in model architectures and training pipelines, most publicly available fine-tuning datasets suffer from low resolution, poor text-image alignment, or limited diversity, resulting in a clear performance gap between open research models and enterprise-grade models. In this work, we present Fine-T2I, a large-scale, high-quality, and fully open dataset for T2I fine-tuning. Fine-T2I spans 10 task combinations, 32 prompt categories, 11 visual styles, and 5 prompt templates, and combines synthetic images generated by strong modern models with carefully curated real images from professional photographers. All samples are rigorously filtered for text-image alignment, visual fidelity, and prompt quality, with over 95% of initial candidates removed. The final dataset contains over 6 million text-image pairs, around 2 TB on disk, approaching the scale of pretraining datasets while maintaining fine-tuning-level quality. Across a diverse set of pretrained diffusion and autoregressive models, fine-tuning on Fine-T2I consistently improves both generation quality and instruction adherence, as validated by human evaluation, visual comparison, and automatic metrics. We release Fine-T2I under an open license to help close the data gap in T2I fine-tuning in the open community.
Paper Structure (32 sections, 13 figures, 8 tables)

This paper contains 32 sections, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Visual comparison across datasets. For each dataset, we randomly sample three text–image pairs (no cherry-picking) to illustrate overall dataset quality. Zoom in for details. Dataset names for each row are provided on the following page.
  • Figure 2: Examples of our introduced Fine-T2I dataset samples, which include diverse resolutions, aspect ratios, styles, categories, tasks, etc. Please check the supplementary for examples with detailed attributes and prompts. Images above the dashed line are our synthetic samples, and those below the dashed line are our curated real images. Please also refer our https://huggingface.co/spaces/ma-xu/fine-t2i-explore to explore more.
  • Figure 3: Semantic cosine-similarities distribution in a random prompt subset. We set deduplication threshold to 0.8.
  • Figure 4: Analysis for our synthetic set, we provide the distribution for sample categories, sample styles, and the related tasks. Notice that for the tasks, we only consider the prompts that have specific requirements for the task when generating the prompts, while about 61.5% of prompts did not ask for specific tasks. Please check Sec. \ref{['sec:detailed_distribution_analysis']} in the supplementary for details.
  • Figure 5: The aesthetic score distribution of our Fine-T2I. Both the synthetic sets and the curated set demonstrate high aesthetic scores, implying strong visual quality for fine-tuning.
  • ...and 8 more figures