Table of Contents
Fetching ...

Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

Jie Zhang, Zhongqi Wang, Mengqi Lei, Zheng Yuan, Bei Yan, Shiguang Shan, Xilin Chen

TL;DR

Dysca addresses data leakage and the lack of stylized/noisy image evaluation in LVLM perception benchmarks by generating a large, synthetic, end-to-end evaluation dataset. It introduces a four-part metadata-driven pipeline that uses diffusion models to synthesize images across 51 styles, constructs 20 perceptual subtasks, and evaluates 14 open-source plus 2 closed-source LVLMs under 4 scenarios and 3 question types, with automated data cleaning. The study demonstrates that synthetic Dysca data yield meaningful model rankings and reveal robust weaknesses under print and adversarial perturbations, while maintaining alignment with real-world performance through correlation analyses and distributional comparisons. Overall, Dysca offers a scalable, configurable, and transparent framework for fine-grained LVLM perception evaluation and potential data-synthesis use for training or benchmarking.

Abstract

Currently many benchmarks have been proposed to evaluate the perception ability of the Large Vision-Language Models (LVLMs). However, most benchmarks conduct questions by selecting images from existing datasets, resulting in the potential data leakage. Besides, these benchmarks merely focus on evaluating LVLMs on the realistic style images and clean scenarios, leaving the multi-stylized images and noisy scenarios unexplored. In response to these challenges, we propose a dynamic and scalable benchmark named Dysca for evaluating LVLMs by leveraging synthesis images. Specifically, we leverage Stable Diffusion and design a rule-based method to dynamically generate novel images, questions and the corresponding answers. We consider 51 kinds of image styles and evaluate the perception capability in 20 subtasks. Moreover, we conduct evaluations under 4 scenarios (i.e., Clean, Corruption, Print Attacking and Adversarial Attacking) and 3 question types (i.e., Multi-choices, True-or-false and Free-form). Thanks to the generative paradigm, Dysca serves as a scalable benchmark for easily adding new subtasks and scenarios. A total of 24 advanced open-source LVLMs and 2 close-source LVLMs are evaluated on Dysca, revealing the drawbacks of current LVLMs. The benchmark is released at https://github.com/Robin-WZQ/Dysca.

Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

TL;DR

Dysca addresses data leakage and the lack of stylized/noisy image evaluation in LVLM perception benchmarks by generating a large, synthetic, end-to-end evaluation dataset. It introduces a four-part metadata-driven pipeline that uses diffusion models to synthesize images across 51 styles, constructs 20 perceptual subtasks, and evaluates 14 open-source plus 2 closed-source LVLMs under 4 scenarios and 3 question types, with automated data cleaning. The study demonstrates that synthetic Dysca data yield meaningful model rankings and reveal robust weaknesses under print and adversarial perturbations, while maintaining alignment with real-world performance through correlation analyses and distributional comparisons. Overall, Dysca offers a scalable, configurable, and transparent framework for fine-grained LVLM perception evaluation and potential data-synthesis use for training or benchmarking.

Abstract

Currently many benchmarks have been proposed to evaluate the perception ability of the Large Vision-Language Models (LVLMs). However, most benchmarks conduct questions by selecting images from existing datasets, resulting in the potential data leakage. Besides, these benchmarks merely focus on evaluating LVLMs on the realistic style images and clean scenarios, leaving the multi-stylized images and noisy scenarios unexplored. In response to these challenges, we propose a dynamic and scalable benchmark named Dysca for evaluating LVLMs by leveraging synthesis images. Specifically, we leverage Stable Diffusion and design a rule-based method to dynamically generate novel images, questions and the corresponding answers. We consider 51 kinds of image styles and evaluate the perception capability in 20 subtasks. Moreover, we conduct evaluations under 4 scenarios (i.e., Clean, Corruption, Print Attacking and Adversarial Attacking) and 3 question types (i.e., Multi-choices, True-or-false and Free-form). Thanks to the generative paradigm, Dysca serves as a scalable benchmark for easily adding new subtasks and scenarios. A total of 24 advanced open-source LVLMs and 2 close-source LVLMs are evaluated on Dysca, revealing the drawbacks of current LVLMs. The benchmark is released at https://github.com/Robin-WZQ/Dysca.

Paper Structure

This paper contains 36 sections, 2 equations, 31 figures, 11 tables.

Figures (31)

  • Figure 1: Overview of the automatic pipeline for generating Vision-language QAs, cleaning Vision-language QAs and evaluating LVLMs. (a) We first constructs prompts in terms of content, style and background, leveraging the Text-to-Image (T2I) diffusion model (e.g., SDXL podell2023sdxl) to synthesis images to be asked. Then based on the scenarios and the question type, we post-process the synthesis images and generate the specific textual questions, respectively. (b) We further filter out low quality Vision-language QAs by utilizing trained models to form the final Dysca. (c) Finally, we evaluate LVLMs on our Dysca and feedback the fine-grained evaluation results.
  • Figure 2: Key statistics of Dysca.
  • Figure 3: The process of generating the prompt (P), image (I) and QA pairs (Q) from metadata (M).
  • Figure 4: The failure cases for the noisy scenarios. From left to right are: corruption scenario, adversarial attacking scenario, and print attacking scenario.
  • Figure 5: Models exhibit different performance when facing the same image but different question types.
  • ...and 26 more figures