Table of Contents
Fetching ...

ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation?

Leixin Zhang, Steffen Eger, Yinjie Cheng, Weihe Zhai, Jonas Belouadi, Christoph Leiter, Simone Paolo Ponzetto, Fahimeh Moafian, Zhixue Zhao

TL;DR

ScImage tackles the problem of generating accurate scientific images from textual descriptions by introducing a dedicated benchmark for multimodal LLMs. It evaluates five models across two output modalities (direct image vs. code-based generation using Python or TikZ) and four languages, with human evaluation on correctness, relevance, and scientific style. Key findings show code-based outputs, especially via GPT-4o, generally outperform direct image outputs, but all models struggle on complex prompts requiring spatial, numeric, and attribute reasoning in combination; spatial understanding remains particularly hard. The work provides a structured dataset, a rigorous evaluation protocol, and a baseline demonstrating both progress and clear gaps, underscoring the need for improved world-knowledge, reasoning, and robust, language-agnostic scientific image generation capabilities.

Abstract

Multimodal large language models (LLMs) have demonstrated impressive capabilities in generating high-quality images from textual instructions. However, their performance in generating scientific images--a critical application for accelerating scientific progress--remains underexplored. In this work, we address this gap by introducing ScImage, a benchmark designed to evaluate the multimodal capabilities of LLMs in generating scientific images from textual descriptions. ScImage assesses three key dimensions of understanding: spatial, numeric, and attribute comprehension, as well as their combinations, focusing on the relationships between scientific objects (e.g., squares, circles). We evaluate five models, GPT-4o, Llama, AutomaTikZ, Dall-E, and StableDiffusion, using two modes of output generation: code-based outputs (Python, TikZ) and direct raster image generation. Additionally, we examine four different input languages: English, German, Farsi, and Chinese. Our evaluation, conducted with 11 scientists across three criteria (correctness, relevance, and scientific accuracy), reveals that while GPT-4o produces outputs of decent quality for simpler prompts involving individual dimensions such as spatial, numeric, or attribute understanding in isolation, all models face challenges in this task, especially for more complex prompts.

ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation?

TL;DR

ScImage tackles the problem of generating accurate scientific images from textual descriptions by introducing a dedicated benchmark for multimodal LLMs. It evaluates five models across two output modalities (direct image vs. code-based generation using Python or TikZ) and four languages, with human evaluation on correctness, relevance, and scientific style. Key findings show code-based outputs, especially via GPT-4o, generally outperform direct image outputs, but all models struggle on complex prompts requiring spatial, numeric, and attribute reasoning in combination; spatial understanding remains particularly hard. The work provides a structured dataset, a rigorous evaluation protocol, and a baseline demonstrating both progress and clear gaps, underscoring the need for improved world-knowledge, reasoning, and robust, language-agnostic scientific image generation capabilities.

Abstract

Multimodal large language models (LLMs) have demonstrated impressive capabilities in generating high-quality images from textual instructions. However, their performance in generating scientific images--a critical application for accelerating scientific progress--remains underexplored. In this work, we address this gap by introducing ScImage, a benchmark designed to evaluate the multimodal capabilities of LLMs in generating scientific images from textual descriptions. ScImage assesses three key dimensions of understanding: spatial, numeric, and attribute comprehension, as well as their combinations, focusing on the relationships between scientific objects (e.g., squares, circles). We evaluate five models, GPT-4o, Llama, AutomaTikZ, Dall-E, and StableDiffusion, using two modes of output generation: code-based outputs (Python, TikZ) and direct raster image generation. Additionally, we examine four different input languages: English, German, Farsi, and Chinese. Our evaluation, conducted with 11 scientists across three criteria (correctness, relevance, and scientific accuracy), reveals that while GPT-4o produces outputs of decent quality for simpler prompts involving individual dimensions such as spatial, numeric, or attribute understanding in isolation, all models face challenges in this task, especially for more complex prompts.

Paper Structure

This paper contains 34 sections, 7 figures, 23 tables.

Figures (7)

  • Figure 1: Illustration of scientific text-to-image generation. The text shown below is the generation query. Images on the left meet the expectations for general text-to-image tasks, while those on the right highlight the specific requirements of scientific image generation. All figures are from our ScImage experiments.
  • Figure 2: Illustration of the three understanding dimensions. The first row shows the individual dimensions of Attribute, Numeric and Spatial understanding. The second row illustrates the combination of two or three dimensions.
  • Figure 3: Comparison of text-code-image and text-image: correctness scores, averaged across model types, of each understanding category. 'Three types' means attribute, numerical and spatial understanding combined.
  • Figure 4: Generation performance of models on different object types. The same scale is used for three radar bars, with the center as correctness score 0, and the outermost circle as 5.
  • Figure 5: Incorrect output from models arguably due to a lack of world knowledge
  • ...and 2 more figures