Table of Contents
Fetching ...

Probing the limitations of multimodal language models for chemistry and materials research

Nawaf Alampara, Mara Schilling-Wilhelmi, Martiño Ríos-García, Indrajeet Mandal, Pranav Khetarpal, Hargun Singh Grover, N. M. Anoop Krishnan, Kevin Maik Jablonka

TL;DR

MaCBench presents a structured, real-world multimodal benchmark for chemistry and materials science, probing data extraction, experiment understanding, and data interpretation across textual and visual inputs. The study reveals strong surface-level perception but fundamental gaps in spatial reasoning, cross-modal integration, and multi-step scientific inference, with model performance correlating to online prevalence of referenced structures. Through targeted ablations and prompt analyses, the work identifies actionable directions, such as synthetic data generation and modality-transfer training, to bolster robust multimodal reasoning. While current systems can assist in routine, well-defined tasks, they fall short of autonomous scientific reasoning, underscoring the need for advances in data curation, architectures, and evaluation for reliable AI-assisted science.

Abstract

Recent advancements in artificial intelligence have sparked interest in scientific assistants that could support researchers across the full spectrum of scientific workflows, from literature review to experimental design and data analysis. A key capability for such systems is the ability to process and reason about scientific information in both visual and textual forms - from interpreting spectroscopic data to understanding laboratory setups. Here, we introduce MaCBench, a comprehensive benchmark for evaluating how vision-language models handle real-world chemistry and materials science tasks across three core aspects: data extraction, experimental understanding, and results interpretation. Through a systematic evaluation of leading models, we find that while these systems show promising capabilities in basic perception tasks - achieving near-perfect performance in equipment identification and standardized data extraction - they exhibit fundamental limitations in spatial reasoning, cross-modal information synthesis, and multi-step logical inference. Our insights have important implications beyond chemistry and materials science, suggesting that developing reliable multimodal AI scientific assistants may require advances in curating suitable training data and approaches to training those models.

Probing the limitations of multimodal language models for chemistry and materials research

TL;DR

MaCBench presents a structured, real-world multimodal benchmark for chemistry and materials science, probing data extraction, experiment understanding, and data interpretation across textual and visual inputs. The study reveals strong surface-level perception but fundamental gaps in spatial reasoning, cross-modal integration, and multi-step scientific inference, with model performance correlating to online prevalence of referenced structures. Through targeted ablations and prompt analyses, the work identifies actionable directions, such as synthetic data generation and modality-transfer training, to bolster robust multimodal reasoning. While current systems can assist in routine, well-defined tasks, they fall short of autonomous scientific reasoning, underscoring the need for advances in data curation, architectures, and evaluation for reliable AI-assisted science.

Abstract

Recent advancements in artificial intelligence have sparked interest in scientific assistants that could support researchers across the full spectrum of scientific workflows, from literature review to experimental design and data analysis. A key capability for such systems is the ability to process and reason about scientific information in both visual and textual forms - from interpreting spectroscopic data to understanding laboratory setups. Here, we introduce MaCBench, a comprehensive benchmark for evaluating how vision-language models handle real-world chemistry and materials science tasks across three core aspects: data extraction, experimental understanding, and results interpretation. Through a systematic evaluation of leading models, we find that while these systems show promising capabilities in basic perception tasks - achieving near-perfect performance in equipment identification and standardized data extraction - they exhibit fundamental limitations in spatial reasoning, cross-modal information synthesis, and multi-step logical inference. Our insights have important implications beyond chemistry and materials science, suggesting that developing reliable multimodal AI scientific assistants may require advances in curating suitable training data and approaches to training those models.

Paper Structure

This paper contains 51 sections, 1 equation, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Overview of the MaCBench framework, covering the multimodal chemistry and materials science research life cycle. The framework evaluates performance across three key domains: data extraction (teal), in silico and laboratory experiments (purple), and data interpretation (pink). The benchmark includes diverse tasks spanning tables, plots, organic chemistry diagrams, crystal structures, imaging, spectroscopy, and materials characterization. Each task requires domain-specific visual understanding and scientific reasoning, from extracting numerical values to analyzing complex experimental setups and interpreting spectroscopic data. We use icons created by Rainy Ting (on svgrepo.com).
  • Figure 2: Distribution of tasks in the MaCBench dataset. MaCBench comprises nine distinct task categories with their respective proportions, ranging from Tables & Plots (35.2%) to & analysis (1.7%). Each segment is annotated with relevant icons indicating the ablations we conducted on those tasks: modality understanding (image icon), guidance requirements (lighthouse icon), reasoning steps (lightbulb icon), and terminology complexity (book icon). The chart illustrates the benchmark's comprehensive coverage of chemistry and materials tasks.
  • Figure 3: Performance of frontier .a. Accuracy gains compared to random baseline across three core scientific tasks, showing varied performance of Claude 3.5 Sonnet, GPT-4o, Gemini Pro, and Llama 3.2 90B Vision in averaged across all task in the three focus areas of MaCBench: data extraction, experimental understanding, and interpretation tasks. We show the performance as the fraction of correctly answered questions relative to a random baseline. A performance of 0 means that the model is indistinguishable from random guessing. The error bars indicate the standard deviation of the fraction of correctly answered questions over five different runs. b. Radar plot demonstrating the relative model performance across ten specialized scientific domains. Again, we show the fraction of correctly answered questions relative to a random baseline (the plots without the normalization are shown in \ref{['fig:overall-performance_unnormalized']}). We can observe substantial differences in performance across topics.
  • Figure 4: Ablation study results across four key dimensions of performance in chemistry and materials science tasks.a. Modality analysis compares performance between image-only and text-only inputs across different task types, with typically higher performance when the same information is shown in text form. b. Step complexity analysis demonstrates performance degradation as tasks require multiple reasoning steps. c. Terminology impact shows how scientific language specificity affects model accuracy, comparing performance with and without domain-specific terminology. We found the behavior on US Patent QA to be mostly due to the sensitivity of Gemini Pro to the prompt template (see \ref{['sec:prompt_fragility']}) d. The guidance study compares performance across different with and without additional task guidance, revealing model-specific sensitivity to prompting strategies. For each task, we calculated the mean score and standard deviation across five independent runs. To summarize performance across models, we averaged the mean scores and standard deviations for each task. For combined tasks (e.g., "XRD QA", "Isotherm QA", "Tables QA"), we employed a two-step averaging process. For each model, we averaged the scores and standard deviations across the sub-tasks. We then averaged these model-specific averages across all models to obtain the final mean score and standard deviation for the combined task. For guidance analysis, performance was measured as the mean score across five independent runs, and the variability was quantified using the standard deviation of those runs. To obtain an overall measure of performance and variability for each side (with and without guidance), we calculated the mean score and the mean standard deviation across all tasks within each side.
  • Figure 5: performance as a function of number of search hits. The plots compare four leading across different crystallographic tasks: a. atomic species identification, b. crystal system classification, c. density calculation, and d. crystal symmetry determination. For each property, the log-scale Google hit counts are plotted against the correctness of model responses, revealing correlations between answer accuracy and the prevalence of information in online sources. Higher hit counts for correct answers suggest models may not solely rely on reasoning in their responses to crystal structure analysis tasks.
  • ...and 7 more figures