Table of Contents
Fetching ...

Seeing Beyond Words: MatVQA for Challenging Visual-Scientific Reasoning in Materials Science

Sifan Wu, Huan Zhang, Yizhan Li, Farshid Effaty, Amirreza Ataei, Bang Liu

TL;DR

The paper tackles the challenge of evaluating true multimodal reasoning in materials science by introducing MatVQA, a scalable, arXiv-derived benchmark focused on visually grounded, research-grade structure–property–performance reasoning. It presents MArxivAgent, an automated pipeline that generates 1325 MCQs across four $SPP$ tasks and iteratively removes language and caption shortcuts to demand fine-grained visual analysis. Benchmark results across 17 MLLMs reveal persistent gaps in visual grounding and multi-hop reasoning, with larger models offering limited gains depending on the task. The work demonstrates significant potential for scalable, domain-specific benchmarking and suggests future expansion to ~12,000 questions and extensions to 3D crystal structures to drive progress in materials discovery.

Abstract

The emergence of Multimodal Large Language Models (MLLMs) that integrate vision and language modalities has unlocked new potentials for scientific reasoning, outperforming prior benchmarks in both natural language and coding domains. Current materials science evaluation datasets such as MaScQA and SciQA remain largely text-based and fail to capture the visual and research-level analytic complexity required in materials discovery and design. We introduce MatVQA, a scalable benchmark specifically designed to address this gap. Generated via an automated pipeline, MArxivAgent, from recent materials literature, MatVQA features 1325 questions across four critical structure-property-performance (SPP) reasoning tasks. Uniquely, MatVQA employs an iterative process to eliminate textual shortcuts, compelling MLLMs to perform fine-grained, low-level visual analysis of material imagery (e.g., microscopy, diffraction patterns) integrated with multi-step scientific reasoning. Benchmarking 17 open- and closed-source MLLMs on MatVQA reveals substantial gaps in current multimodal reasoning capabilities. MatVQA benchmark data, along with evaluation code, is publicly available in \href{https://anonymous.4open.science/r/matvqa-1E01}{https://anonymous.4open.science/r/matvqa-1E01/README.md} to catalyze further research in applying MLLMs to complex materials science problems.

Seeing Beyond Words: MatVQA for Challenging Visual-Scientific Reasoning in Materials Science

TL;DR

The paper tackles the challenge of evaluating true multimodal reasoning in materials science by introducing MatVQA, a scalable, arXiv-derived benchmark focused on visually grounded, research-grade structure–property–performance reasoning. It presents MArxivAgent, an automated pipeline that generates 1325 MCQs across four tasks and iteratively removes language and caption shortcuts to demand fine-grained visual analysis. Benchmark results across 17 MLLMs reveal persistent gaps in visual grounding and multi-hop reasoning, with larger models offering limited gains depending on the task. The work demonstrates significant potential for scalable, domain-specific benchmarking and suggests future expansion to ~12,000 questions and extensions to 3D crystal structures to drive progress in materials discovery.

Abstract

The emergence of Multimodal Large Language Models (MLLMs) that integrate vision and language modalities has unlocked new potentials for scientific reasoning, outperforming prior benchmarks in both natural language and coding domains. Current materials science evaluation datasets such as MaScQA and SciQA remain largely text-based and fail to capture the visual and research-level analytic complexity required in materials discovery and design. We introduce MatVQA, a scalable benchmark specifically designed to address this gap. Generated via an automated pipeline, MArxivAgent, from recent materials literature, MatVQA features 1325 questions across four critical structure-property-performance (SPP) reasoning tasks. Uniquely, MatVQA employs an iterative process to eliminate textual shortcuts, compelling MLLMs to perform fine-grained, low-level visual analysis of material imagery (e.g., microscopy, diffraction patterns) integrated with multi-step scientific reasoning. Benchmarking 17 open- and closed-source MLLMs on MatVQA reveals substantial gaps in current multimodal reasoning capabilities. MatVQA benchmark data, along with evaluation code, is publicly available in \href{https://anonymous.4open.science/r/matvqa-1E01}{https://anonymous.4open.science/r/matvqa-1E01/README.md} to catalyze further research in applying MLLMs to complex materials science problems.

Paper Structure

This paper contains 19 sections, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Domain attribution for MatVQA
  • Figure 2: Construction Pipeline of MatVQA
  • Figure 3: MArxivAgent Pipeline for MCQ automatically Generation. "Lan. Rem" represents the question after langauge shortcut removal. "Cap. Rem" represents the question after removing caption.
  • Figure 4: Evolution of a sample question through the two-stage shortcut removal process. The figure shows the transformation from: the initial 'Raw Sample,' to after 'Language Shortcut removal', and finally to after 'Caption Shortcut removal'.
  • Figure 5: Representative Examples for varies material science domain.
  • ...and 5 more figures