Table of Contents
Fetching ...

REAL: Realism Evaluation of Text-to-Image Generation Models for Effective Data Augmentation

Ran Li, Xiaomeng Jin, Heng ji

TL;DR

REAL introduces a three-dimensional realism evaluation framework for text-to-image outputs, assessing fine-grained visual attributes, unusual visual relationships, and visual styles via VQA prompts and a fine-tuned CLIP classifier. The method demonstrates strong alignment with human judgments (up to a Spearman's ρ of $0.62$) and improves downstream tasks such as image classification, captioning, and visual relationship detection when used to rank and filter augmented data. By benchmarking four major T2I models across realism dimensions, REAL reveals strengths and weaknesses in current outputs and provides a practical realism benchmark for future model improvements. Overall, REAL enables realism-aware data augmentation and model evaluation, with significant gains in downstream performance and clearer guidance for advancing T2I realism.

Abstract

Recent advancements in text-to-image (T2I) generation models have transformed the field. However, challenges persist in generating images that reflect demanding textual descriptions, especially for fine-grained details and unusual relationships. Existing evaluation metrics focus on text-image alignment but overlook the realism of the generated image, which can be crucial for downstream applications like data augmentation in machine learning. To address this gap, we propose REAL, an automatic evaluation framework that assesses realism of T2I outputs along three dimensions: fine-grained visual attributes, unusual visual relationships, and visual styles. REAL achieves a Spearman's rho score of up to 0.62 in alignment with human judgement and demonstrates utility in ranking and filtering augmented data for tasks like image captioning, classification, and visual relationship detection. Empirical results show that high-scoring images evaluated by our metrics improve F1 scores of image classification by up to 11.3%, while low-scoring ones degrade that by up to 4.95%. We benchmark four major T2I models across the realism dimensions, providing insights for future improvements in T2I output realism.

REAL: Realism Evaluation of Text-to-Image Generation Models for Effective Data Augmentation

TL;DR

REAL introduces a three-dimensional realism evaluation framework for text-to-image outputs, assessing fine-grained visual attributes, unusual visual relationships, and visual styles via VQA prompts and a fine-tuned CLIP classifier. The method demonstrates strong alignment with human judgments (up to a Spearman's ρ of ) and improves downstream tasks such as image classification, captioning, and visual relationship detection when used to rank and filter augmented data. By benchmarking four major T2I models across realism dimensions, REAL reveals strengths and weaknesses in current outputs and provides a practical realism benchmark for future model improvements. Overall, REAL enables realism-aware data augmentation and model evaluation, with significant gains in downstream performance and clearer guidance for advancing T2I realism.

Abstract

Recent advancements in text-to-image (T2I) generation models have transformed the field. However, challenges persist in generating images that reflect demanding textual descriptions, especially for fine-grained details and unusual relationships. Existing evaluation metrics focus on text-image alignment but overlook the realism of the generated image, which can be crucial for downstream applications like data augmentation in machine learning. To address this gap, we propose REAL, an automatic evaluation framework that assesses realism of T2I outputs along three dimensions: fine-grained visual attributes, unusual visual relationships, and visual styles. REAL achieves a Spearman's rho score of up to 0.62 in alignment with human judgement and demonstrates utility in ranking and filtering augmented data for tasks like image captioning, classification, and visual relationship detection. Empirical results show that high-scoring images evaluated by our metrics improve F1 scores of image classification by up to 11.3%, while low-scoring ones degrade that by up to 4.95%. We benchmark four major T2I models across the realism dimensions, providing insights for future improvements in T2I output realism.

Paper Structure

This paper contains 26 sections, 3 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Overview of the three dimensions of realism we study. On the left, a image generated by Stable Diffusion v3.5 exhibits incorrect visual attributes for the species, resembling a cat despite having the distinctive tail. In the middle, Kandinsky 3 struggles with the unusual visual relationships, generating a person that's overlapping with the bed instead of carrying it. On the right, DALL-E 3 produces a stylized illustration instead of a photorealistic image as prompted. REAL score correctly identifies all real images, whereas CLIPScore does not.
  • Figure 2: Overview of the three components of the REAL framework. For visual attributes and relationships, REAL performs schematic evaluation on the presense and realism of each component. For visual styles, REAL leverages a fine-tuned CLIP model for rating.
  • Figure 3: Examples of high and low quality images.
  • Figure 4: image classification on iNaturalist dataset (F1 score)
  • Figure 5: image captioning on iNaturalist dataset (BLEU score)
  • ...and 2 more figures