Table of Contents
Fetching ...

Q-REAL: Towards Realism and Plausibility Evaluation for AI-Generated Content

Shushi Wang, Zicheng Zhang, Chunyi Li, Wei Wang, Liya Ma, Fengjiao Chen, Xiaoyu Li, Xuezhi Cao, Guangtao Zhai, Xiaohong Liu

TL;DR

This work addresses the need for fine-grained realism and plausibility evaluation of AI-generated images, moving beyond single-score quality assessments. It introduces Q-Real, a large dataset of 3,088 images with per-object bounding boxes and dual-dimension QA (realism and plausibility), totaling 429,096 annotations across 17,879 entities, plus a 500-portrait subset with enhanced explanations. Building on Q-Real, the authors establish Q-Real Bench with ObjectQA and ImageQA tasks, and propose a Task-Specific Finetuning framework to improve multi-modal large language models on grounding and reasoning about image quality. Experiments across multiple MLLMs show that finetuning markedly enhances both judgment and grounding capabilities, demonstrating the dataset’s usefulness for guiding generative-model improvements and pushing forward reliable evaluation of AI-generated content. The work lays groundwork for more actionable feedback signals to optimize text-to-image systems and provides a practical protocol for fine-grained evaluation in real-world applications.

Abstract

Quality assessment of AI-generated content is crucial for evaluating model capability and guiding model optimization. However, most existing quality assessment datasets and models provide only a single quality score, which is too coarse to offer targeted guidance for improving generative models. In current applications of AI-generated images, realism and plausibility are two critical dimensions, and with the emergence of unified generation-understanding models, fine-grained evaluation along these dimensions becomes especially effective for improving generative performance. Therefore, we introduce Q-Real, a novel dataset for fine-grained evaluation of realism and plausibility in AI-generated images. Q-Real consists of 3,088 images generated by popular text-to-image models. For each image, we annotate the locations of major entities and provide a set of judgment questions and attribution descriptions for these along the dimensions of realism and plausibility. Considering that recent advances in multi-modal large language models (MLLMs) enable fine-grained evaluation of AI-generated images, we construct Q-Real Bench to evaluate them on two tasks: judgment and grounding with reasoning. Finally, to enhance MLLM capabilities, we design a fine-tuning framework and conduct experiments on multiple MLLMs using our dataset. Experimental results demonstrate the high quality and significance of our dataset and the comprehensiveness of the benchmark. Dataset and code will be released upon publication.

Q-REAL: Towards Realism and Plausibility Evaluation for AI-Generated Content

TL;DR

This work addresses the need for fine-grained realism and plausibility evaluation of AI-generated images, moving beyond single-score quality assessments. It introduces Q-Real, a large dataset of 3,088 images with per-object bounding boxes and dual-dimension QA (realism and plausibility), totaling 429,096 annotations across 17,879 entities, plus a 500-portrait subset with enhanced explanations. Building on Q-Real, the authors establish Q-Real Bench with ObjectQA and ImageQA tasks, and propose a Task-Specific Finetuning framework to improve multi-modal large language models on grounding and reasoning about image quality. Experiments across multiple MLLMs show that finetuning markedly enhances both judgment and grounding capabilities, demonstrating the dataset’s usefulness for guiding generative-model improvements and pushing forward reliable evaluation of AI-generated content. The work lays groundwork for more actionable feedback signals to optimize text-to-image systems and provides a practical protocol for fine-grained evaluation in real-world applications.

Abstract

Quality assessment of AI-generated content is crucial for evaluating model capability and guiding model optimization. However, most existing quality assessment datasets and models provide only a single quality score, which is too coarse to offer targeted guidance for improving generative models. In current applications of AI-generated images, realism and plausibility are two critical dimensions, and with the emergence of unified generation-understanding models, fine-grained evaluation along these dimensions becomes especially effective for improving generative performance. Therefore, we introduce Q-Real, a novel dataset for fine-grained evaluation of realism and plausibility in AI-generated images. Q-Real consists of 3,088 images generated by popular text-to-image models. For each image, we annotate the locations of major entities and provide a set of judgment questions and attribution descriptions for these along the dimensions of realism and plausibility. Considering that recent advances in multi-modal large language models (MLLMs) enable fine-grained evaluation of AI-generated images, we construct Q-Real Bench to evaluate them on two tasks: judgment and grounding with reasoning. Finally, to enhance MLLM capabilities, we design a fine-tuning framework and conduct experiments on multiple MLLMs using our dataset. Experimental results demonstrate the high quality and significance of our dataset and the comprehensiveness of the benchmark. Dataset and code will be released upon publication.

Paper Structure

This paper contains 15 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Abilities of Q-Real-tuned InternVL2.5-8B on two kinds of fine-grained quality assessment in AI-generated images, in comparison with the baseline version and GPT-4o.
  • Figure 2: Application scenarios of ObjectQA and ImageQA in Q-Real Bench. ObjectQA can be used for large-scale image screening, while ImageQA provides fine-grained descriptions that can facilitate the optimization of generative models.
  • Figure 3: The pipeline of our dataset construction.
  • Figure 4: Detailed annotation protocol for Human Plausibility. Comparison between our fine-grained annotation and the conventional annotation approach for human regions.
  • Figure 5: Overview of Q-Real Bench, illustrating its two tasks: ObjectQA and ImageQA, including evaluation procedures and metrics.
  • ...and 1 more figures