Table of Contents
Fetching ...

SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model

Yifan Chang, Yukang Feng, Jianwen Sun, Jiaxin Ai, Chuanhao Li, S. Kevin Zhou, Kaipeng Zhang

TL;DR

SridBench addresses the lack of objective evaluation for AI-driven scientific illustration by introducing a benchmark of 1,120 generation instances across 13 disciplines and six quality dimensions. It combines human expert curation with MLLM-assisted automatic scoring to assess semantic fidelity, structural integrity, and textual accuracy in generated scientific figures. Empirical results show that current models, including GPT-4o-image, lag behind expert-crafted graphics, highlighting the need for reasoning-driven improvements and better multimodal alignment. The work provides a valuable resource and framework for advancing controllable, content-aware scientific illustration generation with practical implications for research workflows.

Abstract

Recent years have seen rapid advances in AI-driven image generation. Early diffusion models emphasized perceptual quality, while newer multimodal models like GPT-4o-image integrate high-level reasoning, improving semantic understanding and structural composition. Scientific illustration generation exemplifies this evolution: unlike general image synthesis, it demands accurate interpretation of technical content and transformation of abstract ideas into clear, standardized visuals. This task is significantly more knowledge-intensive and laborious, often requiring hours of manual work and specialized tools. Automating it in a controllable, intelligent manner would provide substantial practical value. Yet, no benchmark currently exists to evaluate AI on this front. To fill this gap, we introduce SridBench, the first benchmark for scientific figure generation. It comprises 1,120 instances curated from leading scientific papers across 13 natural and computer science disciplines, collected via human experts and MLLMs. Each sample is evaluated along six dimensions, including semantic fidelity and structural accuracy. Experimental results reveal that even top-tier models like GPT-4o-image lag behind human performance, with common issues in text/visual clarity and scientific correctness. These findings highlight the need for more advanced reasoning-driven visual generation capabilities.

SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model

TL;DR

SridBench addresses the lack of objective evaluation for AI-driven scientific illustration by introducing a benchmark of 1,120 generation instances across 13 disciplines and six quality dimensions. It combines human expert curation with MLLM-assisted automatic scoring to assess semantic fidelity, structural integrity, and textual accuracy in generated scientific figures. Empirical results show that current models, including GPT-4o-image, lag behind expert-crafted graphics, highlighting the need for reasoning-driven improvements and better multimodal alignment. The work provides a valuable resource and framework for advancing controllable, content-aware scientific illustration generation with practical implications for research workflows.

Abstract

Recent years have seen rapid advances in AI-driven image generation. Early diffusion models emphasized perceptual quality, while newer multimodal models like GPT-4o-image integrate high-level reasoning, improving semantic understanding and structural composition. Scientific illustration generation exemplifies this evolution: unlike general image synthesis, it demands accurate interpretation of technical content and transformation of abstract ideas into clear, standardized visuals. This task is significantly more knowledge-intensive and laborious, often requiring hours of manual work and specialized tools. Automating it in a controllable, intelligent manner would provide substantial practical value. Yet, no benchmark currently exists to evaluate AI on this front. To fill this gap, we introduce SridBench, the first benchmark for scientific figure generation. It comprises 1,120 instances curated from leading scientific papers across 13 natural and computer science disciplines, collected via human experts and MLLMs. Each sample is evaluated along six dimensions, including semantic fidelity and structural accuracy. Experimental results reveal that even top-tier models like GPT-4o-image lag behind human performance, with common issues in text/visual clarity and scientific correctness. These findings highlight the need for more advanced reasoning-driven visual generation capabilities.

Paper Structure

This paper contains 15 sections, 9 figures.

Figures (9)

  • Figure 1: General description of SridBench. We collected triple data from 13 directions in natural science and computer science, and designed 6 evaluation metrics
  • Figure 2: The framework of our Benchmark of Scientific Research Illustration Drawing of Image Generation Model. As can be seen from the framework, human experts set the standards for batch downloading and filtering paper data from the Internet. MLLM and human experts work together to screen triplet data to ensure the authority and scientific nature of the data. At the same time, we use the MLLM which is consistent with the human preference and evaluation for automatic scoring.
  • Figure 3: (a). On the computer science and natural science data, the average score of GPT-4o-image and Gemini-2.0-Flash scores in the six major indicators judged by GPT-4o. (b). For images generated by GPT-4o-image and Gemini-2.0-Flash, the comparison of score judged by Gemini-2.0-pro, GPT-4o and human expert.
  • Figure 4: On different subjects of natural science data, the average score of GPT-4o-image and Gemini-2.0-Flash scores in the six major indicators.
  • Figure 5: On different subjects of computer science data, the average score of GPT-4o-image and Gemini-2.0-Flash scores in the six major indicators.
  • ...and 4 more figures