Mechanisms of Matter: Language Inferential Benchmark on Physicochemical Hypothesis in Materials Synthesis
Yingming Pu, Tao Lin, Hongyu Chen
TL;DR
MatterMech introduces a comprehensive benchmark to evaluate LLM-driven physicochemical hypotheses in eight nanomaterial-synthesis subdomains, linking a large literature corpus to a parallel knowledge base of principles. The framework combines MCQ and text-infilling tasks, enhanced by a multi-agent validation process and human evaluation, to assess grounding in physicochemical principles and reasoning quality. Key findings show that principle-guided prompting improves accuracy and efficiency over standard Chain-of-Thought, that larger models generally perform better but can struggle with domain-specific terminology, and that scaling laws and output patterns reveal trade-offs between depth of reasoning and concise expression. The work provides a practical resource (MatterMech and MatterDB) and concrete directions for advancing reliable, principle-based hypothesis generation in materials science.
Abstract
The capacity of Large Language Models (LLMs) to generate valid scientific hypotheses for materials synthesis remains largely unquantified, hindered by the absence of benchmarks probing physicochemical logics reasoning. To address this, we introduce MatterMech, a benchmark for evaluating LLM-generated hypotheses across eight nanomaterial synthesis domains. Our analysis reveals a critical disconnect: LLMs are proficient in abstract logic yet fail to ground their reasoning in fundamental physicochemical principles. We demonstrate that our proposed principle-aware prompting methodology substantially outperforms standard Chain-of-Thought, enhancing both hypothesis accuracy and computational efficiency. This work provides a methodological framework to advance LLMs toward reliable scientific hypothesis generation in materials science. The MatterMech benchmark and associated code is publicly available at \href{https://github.com/amair-lab/MatterMech}{GitHub}.
