Are Images Indistinguishable to Humans Also Indistinguishable to Classifiers?

Zebin You; Xinyu Zhang; Hanzhong Guo; Jingdong Wang; Chongxuan Li

Are Images Indistinguishable to Humans Also Indistinguishable to Classifiers?

Zebin You, Xinyu Zhang, Hanzhong Guo, Jingdong Wang, Chongxuan Li

TL;DR

This work introduces distribution classification as a diagnostic tool to quantify how far diffusion-generated images are from real data, revealing that neural-network classifiers can reliably distinguish real from generated distributions even when FID and human judgments suggest high realism. It shows contradictions between classifier-based distribution distance, FID, and human perception, and demonstrates practical implications including when to augment versus replace real data and how classifier guidance can improve generated image realism. The study provides granular insights into spatial and frequency features learned by diffusion models and highlights MAD (Model Autophagy Disorder) as a consequence of distribution drift in continual synthetic-data loops. Overall, distribution classification offers a complementary perspective to traditional metrics, enabling deeper understanding and practical improvements in generative modeling and data usage.

Abstract

The ultimate goal of generative models is to perfectly capture the data distribution. For image generation, common metrics of visual quality (e.g., FID) and the perceived truthfulness of generated images seem to suggest that we are nearing this goal. However, through distribution classification tasks, we reveal that, from the perspective of neural network-based classifiers, even advanced diffusion models are still far from this goal. Specifically, classifiers are able to consistently and effortlessly distinguish real images from generated ones across various settings. Moreover, we uncover an intriguing discrepancy: classifiers can easily differentiate between diffusion models with comparable performance (e.g., U-ViT-H vs. DiT-XL), but struggle to distinguish between models within the same family but of different scales (e.g., EDM2-XS vs. EDM2-XXL). Our methodology carries several important implications. First, it naturally serves as a diagnostic tool for diffusion models by analyzing specific features of generated data. Second, it sheds light on the model autophagy disorder and offers insights into the use of generated data: augmenting real data with generated data is more effective than replacing it. Third, classifier guidance can significantly enhance the realism of generated images.

Are Images Indistinguishable to Humans Also Indistinguishable to Classifiers?

TL;DR

Abstract

Paper Structure (28 sections, 7 equations, 12 figures, 16 tables)

This paper contains 28 sections, 7 equations, 12 figures, 16 tables.

Introduction
Related work
Experimental settings
Generated distributions are easily classified as generated
Classifiers cannot distinguish samples from the same distribution
Classifiers achieve high accuracy across various settings
Self-supervised classifiers can also identify generated images
Classifier contradictions with FID and human judgments
Implications of distribution classification task
Diagnosing problem within diffusion models
Evaluating the use of generated data
Enhancing Realism in Generated Images
Conclusion
Background
Diffusion model
...and 13 more sections

Figures (12)

Figure 1: Four-way distribution classification tasks:Which one is real in each row?The samples are from real images or generated from state-of-the-art diffusion models. Notably, classifiers consistently and effortlessly distinguish between real and generated images in all settings.
Figure 2: Binary distribution classification on label-to-image. A positive correlation is observed between accuracy and the number of training samples. With only 50k samples per distribution, classifiers achieve over 99.5% accuracy.
Figure 3: User study. Classifiers can easily distinguish between diffusion models with similar FIDs (U-H/2, D) but struggle with models from the same family that differ in parameters (E2-XS, E2-XXL). In contrast, humans show the opposite trend.
Figure 4: Visualization of frequency domain processing.Top: Low-pass filters, Middle: High-pass filters, each processed with increasing thresholds: 10, 30, 50, 100. Bottom: Band-pass filters, processed with band thresholds: 0-10, 10-30, 30-50, 50-100.
Figure 5: Classifier achieves high accuracy across various combinations with limited frequency components, except when only minimal low-frequency components are present (in subfig (a) low-pass filter threshold = 10, in subfig (c) band-pass filter threshold: 0-10).
...and 7 more figures

Are Images Indistinguishable to Humans Also Indistinguishable to Classifiers?

TL;DR

Abstract

Are Images Indistinguishable to Humans Also Indistinguishable to Classifiers?

Authors

TL;DR

Abstract

Table of Contents

Figures (12)