Seeing Beyond Words: MatVQA for Challenging Visual-Scientific Reasoning in Materials Science
Sifan Wu, Huan Zhang, Yizhan Li, Farshid Effaty, Amirreza Ataei, Bang Liu
TL;DR
The paper tackles the challenge of evaluating true multimodal reasoning in materials science by introducing MatVQA, a scalable, arXiv-derived benchmark focused on visually grounded, research-grade structure–property–performance reasoning. It presents MArxivAgent, an automated pipeline that generates 1325 MCQs across four $SPP$ tasks and iteratively removes language and caption shortcuts to demand fine-grained visual analysis. Benchmark results across 17 MLLMs reveal persistent gaps in visual grounding and multi-hop reasoning, with larger models offering limited gains depending on the task. The work demonstrates significant potential for scalable, domain-specific benchmarking and suggests future expansion to ~12,000 questions and extensions to 3D crystal structures to drive progress in materials discovery.
Abstract
The emergence of Multimodal Large Language Models (MLLMs) that integrate vision and language modalities has unlocked new potentials for scientific reasoning, outperforming prior benchmarks in both natural language and coding domains. Current materials science evaluation datasets such as MaScQA and SciQA remain largely text-based and fail to capture the visual and research-level analytic complexity required in materials discovery and design. We introduce MatVQA, a scalable benchmark specifically designed to address this gap. Generated via an automated pipeline, MArxivAgent, from recent materials literature, MatVQA features 1325 questions across four critical structure-property-performance (SPP) reasoning tasks. Uniquely, MatVQA employs an iterative process to eliminate textual shortcuts, compelling MLLMs to perform fine-grained, low-level visual analysis of material imagery (e.g., microscopy, diffraction patterns) integrated with multi-step scientific reasoning. Benchmarking 17 open- and closed-source MLLMs on MatVQA reveals substantial gaps in current multimodal reasoning capabilities. MatVQA benchmark data, along with evaluation code, is publicly available in \href{https://anonymous.4open.science/r/matvqa-1E01}{https://anonymous.4open.science/r/matvqa-1E01/README.md} to catalyze further research in applying MLLMs to complex materials science problems.
