Table of Contents
Fetching ...

Mechanisms of Matter: Language Inferential Benchmark on Physicochemical Hypothesis in Materials Synthesis

Yingming Pu, Tao Lin, Hongyu Chen

TL;DR

MatterMech introduces a comprehensive benchmark to evaluate LLM-driven physicochemical hypotheses in eight nanomaterial-synthesis subdomains, linking a large literature corpus to a parallel knowledge base of principles. The framework combines MCQ and text-infilling tasks, enhanced by a multi-agent validation process and human evaluation, to assess grounding in physicochemical principles and reasoning quality. Key findings show that principle-guided prompting improves accuracy and efficiency over standard Chain-of-Thought, that larger models generally perform better but can struggle with domain-specific terminology, and that scaling laws and output patterns reveal trade-offs between depth of reasoning and concise expression. The work provides a practical resource (MatterMech and MatterDB) and concrete directions for advancing reliable, principle-based hypothesis generation in materials science.

Abstract

The capacity of Large Language Models (LLMs) to generate valid scientific hypotheses for materials synthesis remains largely unquantified, hindered by the absence of benchmarks probing physicochemical logics reasoning. To address this, we introduce MatterMech, a benchmark for evaluating LLM-generated hypotheses across eight nanomaterial synthesis domains. Our analysis reveals a critical disconnect: LLMs are proficient in abstract logic yet fail to ground their reasoning in fundamental physicochemical principles. We demonstrate that our proposed principle-aware prompting methodology substantially outperforms standard Chain-of-Thought, enhancing both hypothesis accuracy and computational efficiency. This work provides a methodological framework to advance LLMs toward reliable scientific hypothesis generation in materials science. The MatterMech benchmark and associated code is publicly available at \href{https://github.com/amair-lab/MatterMech}{GitHub}.

Mechanisms of Matter: Language Inferential Benchmark on Physicochemical Hypothesis in Materials Synthesis

TL;DR

MatterMech introduces a comprehensive benchmark to evaluate LLM-driven physicochemical hypotheses in eight nanomaterial-synthesis subdomains, linking a large literature corpus to a parallel knowledge base of principles. The framework combines MCQ and text-infilling tasks, enhanced by a multi-agent validation process and human evaluation, to assess grounding in physicochemical principles and reasoning quality. Key findings show that principle-guided prompting improves accuracy and efficiency over standard Chain-of-Thought, that larger models generally perform better but can struggle with domain-specific terminology, and that scaling laws and output patterns reveal trade-offs between depth of reasoning and concise expression. The work provides a practical resource (MatterMech and MatterDB) and concrete directions for advancing reliable, principle-based hypothesis generation in materials science.

Abstract

The capacity of Large Language Models (LLMs) to generate valid scientific hypotheses for materials synthesis remains largely unquantified, hindered by the absence of benchmarks probing physicochemical logics reasoning. To address this, we introduce MatterMech, a benchmark for evaluating LLM-generated hypotheses across eight nanomaterial synthesis domains. Our analysis reveals a critical disconnect: LLMs are proficient in abstract logic yet fail to ground their reasoning in fundamental physicochemical principles. We demonstrate that our proposed principle-aware prompting methodology substantially outperforms standard Chain-of-Thought, enhancing both hypothesis accuracy and computational efficiency. This work provides a methodological framework to advance LLMs toward reliable scientific hypothesis generation in materials science. The MatterMech benchmark and associated code is publicly available at \href{https://github.com/amair-lab/MatterMech}{GitHub}.

Paper Structure

This paper contains 43 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An overview of MatterMech. Based on documents sampled in article corpus, we intially generate questions with language model (e.g., Claude-3.5-Sunnet), and then through multi-agent system with roundtable to make modifications on options to improve the consisitency. Appened with a human checking process to make consensus on created benchmark questions.
  • Figure 2: Manually validated quality of created MCQs with a 5-point Likert scale. The average scores for three metrics are all around 4.0, indicating a well-aligned consensus with humans.
  • Figure 3: Comparative performance of different subdomains across the tested models, each value is the average of all models.
  • Figure 4: Scaling law of model size and performance. We compare the performances of TIF and MCQ over Qwen2.5 family.
  • Figure 5: Completion length distrobution of reasoing models and their base models, x-axis is the number of output tokens, y-axis is the density.
  • ...and 1 more figures