ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation?
Leixin Zhang, Steffen Eger, Yinjie Cheng, Weihe Zhai, Jonas Belouadi, Christoph Leiter, Simone Paolo Ponzetto, Fahimeh Moafian, Zhixue Zhao
TL;DR
ScImage tackles the problem of generating accurate scientific images from textual descriptions by introducing a dedicated benchmark for multimodal LLMs. It evaluates five models across two output modalities (direct image vs. code-based generation using Python or TikZ) and four languages, with human evaluation on correctness, relevance, and scientific style. Key findings show code-based outputs, especially via GPT-4o, generally outperform direct image outputs, but all models struggle on complex prompts requiring spatial, numeric, and attribute reasoning in combination; spatial understanding remains particularly hard. The work provides a structured dataset, a rigorous evaluation protocol, and a baseline demonstrating both progress and clear gaps, underscoring the need for improved world-knowledge, reasoning, and robust, language-agnostic scientific image generation capabilities.
Abstract
Multimodal large language models (LLMs) have demonstrated impressive capabilities in generating high-quality images from textual instructions. However, their performance in generating scientific images--a critical application for accelerating scientific progress--remains underexplored. In this work, we address this gap by introducing ScImage, a benchmark designed to evaluate the multimodal capabilities of LLMs in generating scientific images from textual descriptions. ScImage assesses three key dimensions of understanding: spatial, numeric, and attribute comprehension, as well as their combinations, focusing on the relationships between scientific objects (e.g., squares, circles). We evaluate five models, GPT-4o, Llama, AutomaTikZ, Dall-E, and StableDiffusion, using two modes of output generation: code-based outputs (Python, TikZ) and direct raster image generation. Additionally, we examine four different input languages: English, German, Farsi, and Chinese. Our evaluation, conducted with 11 scientists across three criteria (correctness, relevance, and scientific accuracy), reveals that while GPT-4o produces outputs of decent quality for simpler prompts involving individual dimensions such as spatial, numeric, or attribute understanding in isolation, all models face challenges in this task, especially for more complex prompts.
