Table of Contents
Fetching ...

DiffuSyn Bench: Evaluating Vision-Language Models on Real-World Complexities with Diffusion-Generated Synthetic Benchmarks

Haokun Zhou, Yipeng Hong

TL;DR

This paper introduces DiffuSyn Bench, a diffusion-based, automated framework for generating synthetic text–image benchmarks that embed controlled real-world inconsistencies. It investigates how Large Vision-Language Models (LVLMs) distinguish AI-generated versus human-generated imagery, revealing a substantial human–LVLM perception gap and a bias toward predicting human origin. By employing a three-stage pipeline and LLM-as-a-Judge (LAJ) scoring, the authors quantify model weaknesses across temporal, biological, and logical error types, and show that diffusion-generated benchmarks can be scaled and externally validated while preserving model rankings. The findings underscore the need for LVLMs with deeper physical and causal reasoning and offer a scalable method to continuously stress-test vision–language capabilities on realistic, complex scenarios.

Abstract

This study assesses the ability of Large Vision-Language Models (LVLMs) to differentiate between AI-generated and human-generated images. It introduces a new automated benchmark construction method for this evaluation. The experiment compared common LVLMs with human participants using a mixed dataset of AI and human-created images. Results showed that LVLMs could distinguish between the image types to some extent but exhibited a rightward bias, and perform significantly worse compared to humans. To build on these findings, we developed an automated benchmark construction process using AI. This process involved topic retrieval, narrative script generation, error embedding, and image generation, creating a diverse set of text-image pairs with intentional errors. We validated our method through constructing two caparable benchmarks. This study highlights the strengths and weaknesses of LVLMs in real-world understanding and advances benchmark construction techniques, providing a scalable and automatic approach for AI model evaluation.

DiffuSyn Bench: Evaluating Vision-Language Models on Real-World Complexities with Diffusion-Generated Synthetic Benchmarks

TL;DR

This paper introduces DiffuSyn Bench, a diffusion-based, automated framework for generating synthetic text–image benchmarks that embed controlled real-world inconsistencies. It investigates how Large Vision-Language Models (LVLMs) distinguish AI-generated versus human-generated imagery, revealing a substantial human–LVLM perception gap and a bias toward predicting human origin. By employing a three-stage pipeline and LLM-as-a-Judge (LAJ) scoring, the authors quantify model weaknesses across temporal, biological, and logical error types, and show that diffusion-generated benchmarks can be scaled and externally validated while preserving model rankings. The findings underscore the need for LVLMs with deeper physical and causal reasoning and offer a scalable method to continuously stress-test vision–language capabilities on realistic, complex scenarios.

Abstract

This study assesses the ability of Large Vision-Language Models (LVLMs) to differentiate between AI-generated and human-generated images. It introduces a new automated benchmark construction method for this evaluation. The experiment compared common LVLMs with human participants using a mixed dataset of AI and human-created images. Results showed that LVLMs could distinguish between the image types to some extent but exhibited a rightward bias, and perform significantly worse compared to humans. To build on these findings, we developed an automated benchmark construction process using AI. This process involved topic retrieval, narrative script generation, error embedding, and image generation, creating a diverse set of text-image pairs with intentional errors. We validated our method through constructing two caparable benchmarks. This study highlights the strengths and weaknesses of LVLMs in real-world understanding and advances benchmark construction techniques, providing a scalable and automatic approach for AI model evaluation.
Paper Structure (30 sections, 1 equation, 7 figures)

This paper contains 30 sections, 1 equation, 7 figures.

Figures (7)

  • Figure 1: A brief framework for benchmark generation and a specific example from our benchmark, showing a GPT-4V response. The example image depicts a modern office with a medieval knight using a laptop, highlighting intentional anachronism.
  • Figure 2: Automated framework to generate diffusion based synthetic evaluations.
  • Figure 3: Performance comparison between LVLMs and humans in distinguishing AI-generated vs. human-generated images (six heatmaps, one per system).
  • Figure 4: Overall correctness of humans and models on AI- vs. non-AI-generated images. The red dashed line (50%) marks random-guessing.
  • Figure 5: Performance of four models across three error types. GPT-4V attains the highest totals across temporal, biological, and logical categories.
  • ...and 2 more figures