Table of Contents
Fetching ...

When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators

Krzysztof Adamkiewicz, Brian Moser, Stanislav Frolov, Tobias Christian Nauen, Federico Raue, Andreas Dengel

TL;DR

The promise of synthetic data as a scalable substitute for real training sets is revisited and a surprising performance regression is uncovered, highlighting an urgent need to rethink the capabilities of modern T2I models as reliable training data generators.

Abstract

Recent text-to-image (T2I) diffusion models produce visually stunning images and demonstrate excellent prompt following. But do they perform well as synthetic vision data generators? In this work, we revisit the promise of synthetic data as a scalable substitute for real training sets and uncover a surprising performance regression. We generate large-scale synthetic datasets using state-of-the-art T2I models released between 2022 and 2025, train standard classifiers solely on this synthetic data, and evaluate them on real test data. Despite observable advances in visual fidelity and prompt adherence, classification accuracy on real test data consistently declines with newer T2I models as training data generators. Our analysis reveals a hidden trend: These models collapse to a narrow, aesthetic-centric distribution that undermines diversity and label-image alignment. Overall, our findings challenge a growing assumption in vision research, namely that progress in generative realism implies progress in data realism. We thus highlight an urgent need to rethink the capabilities of modern T2I models as reliable training data generators.

When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators

TL;DR

The promise of synthetic data as a scalable substitute for real training sets is revisited and a surprising performance regression is uncovered, highlighting an urgent need to rethink the capabilities of modern T2I models as reliable training data generators.

Abstract

Recent text-to-image (T2I) diffusion models produce visually stunning images and demonstrate excellent prompt following. But do they perform well as synthetic vision data generators? In this work, we revisit the promise of synthetic data as a scalable substitute for real training sets and uncover a surprising performance regression. We generate large-scale synthetic datasets using state-of-the-art T2I models released between 2022 and 2025, train standard classifiers solely on this synthetic data, and evaluate them on real test data. Despite observable advances in visual fidelity and prompt adherence, classification accuracy on real test data consistently declines with newer T2I models as training data generators. Our analysis reveals a hidden trend: These models collapse to a narrow, aesthetic-centric distribution that undermines diversity and label-image alignment. Overall, our findings challenge a growing assumption in vision research, namely that progress in generative realism implies progress in data realism. We thus highlight an urgent need to rethink the capabilities of modern T2I models as reliable training data generators.
Paper Structure (31 sections, 7 figures)

This paper contains 31 sections, 7 figures.

Figures (7)

  • Figure 1: We train ResNet-50 classifiers on images generated by various T2I models for a subset of ImageNet-1k classes and evaluate their accuracy on real test data (Synth $\rightarrow$ Real). Our results reveal a downward trend over time. Newer models get progressively worse as reliable training data generators.
  • Figure 2: To probe which aspects of synthetic images are most affected, we transform images to suppress or amplify the effects of distortions in a given domain. To separate the effect of low and high level details, we measure the performance gap when training in depth space, which removes textures, and training a low-receptive-field (visualized in the figure) classifier which operates on $9\times9$ image patches and hence does not rely on structure. To separate the effects of high and low frequency distortions, we train on low and high-pass filtered images. Removing offending features should close the gap with relation to RGB, while removing non-offending features should widen it.
  • Figure 3: Accuracy on the real ImageNet-1k test set versus GenEval score (top) and CLIPScore (bottom). Each point represents the performance of a classifier trained on data synthesized by a specific T2I model; the horizontal line indicates the baseline trained on real data. Across architectures, we observe a downward trend; higher benchmark scores correspond to lower transfer performance for class label prompts.
  • Figure 4: Performance comparison for (left) structure (depth-based classifier) and texture (local feature classifier), and (right) frequency-filtered data for class name- and caption-guided synthetic datasets. Image structure is consistently less affected than texture, while high-frequency components degrade more strongly than low frequencies (especially in better-performing models).
  • Figure 5: Dataset diversity using density and coverage metrics from naeem2020reliable, plotted against classifier accuracy on real data (color). Models with high density but low coverage produce visually consistent yet distributionally narrow samples, while those with higher coverage span a broader portion of real data space and correlate with better generalization. Thus, recent T2I models achieve higher sample quality through compact, high-density clusters but sacrifice diversity essential for training quality.
  • ...and 2 more figures