Table of Contents
Fetching ...

MatPROV: A Provenance Graph Dataset of Material Synthesis Extracted from Scientific Literature

Hirofumi Tsuruta, Masaya Kumagai

TL;DR

MatPROV presents a PROV-DM-based framework for representing material synthesis procedures as provenance graphs, enabling flexible, graph-structured knowledge beyond linear sequences. It releases MatPROV, a PROV-DM-compliant dataset consisting of 2,367 procedures from 1,568 open-access papers, serialized in PROV-JSONLD with ten synthesis-parameter attributes. The authors validate LLM-based extraction against expert ground truth, showing that advanced models can produce coherent DAGs with meaningful structure and parameters, albeit with variability and prompting sensitivity. The work advances machine-interpretable synthesis knowledge with potential use in automated synthesis planning and optimization, while noting biases toward certain material classes and the need for broader, rigorous evaluation. Overall, MatPROV demonstrates both the promise and current limitations of graph-based extraction of complex procedural knowledge from scientific literature.

Abstract

Synthesis procedures play a critical role in materials research, as they directly affect material properties. With data-driven approaches increasingly accelerating materials discovery, there is growing interest in extracting synthesis procedures from scientific literature as structured data. However, existing studies often rely on rigid, domain-specific schemas with predefined fields for structuring synthesis procedures or assume that synthesis procedures are linear sequences of operations, which limits their ability to capture the structural complexity of real-world procedures. To address these limitations, we adopt PROV-DM, an international standard for provenance information, which supports flexible, graph-based modeling of procedures. We present MatPROV, a dataset of PROV-DM-compliant synthesis procedures extracted from scientific literature using large language models. MatPROV captures structural complexities and causal relationships among materials, operations, and conditions through visually intuitive directed graphs. This representation enables machine-interpretable synthesis knowledge, opening opportunities for future research such as automated synthesis planning and optimization.

MatPROV: A Provenance Graph Dataset of Material Synthesis Extracted from Scientific Literature

TL;DR

MatPROV presents a PROV-DM-based framework for representing material synthesis procedures as provenance graphs, enabling flexible, graph-structured knowledge beyond linear sequences. It releases MatPROV, a PROV-DM-compliant dataset consisting of 2,367 procedures from 1,568 open-access papers, serialized in PROV-JSONLD with ten synthesis-parameter attributes. The authors validate LLM-based extraction against expert ground truth, showing that advanced models can produce coherent DAGs with meaningful structure and parameters, albeit with variability and prompting sensitivity. The work advances machine-interpretable synthesis knowledge with potential use in automated synthesis planning and optimization, while noting biases toward certain material classes and the need for broader, rigorous evaluation. Overall, MatPROV demonstrates both the promise and current limitations of graph-based extraction of complex procedural knowledge from scientific literature.

Abstract

Synthesis procedures play a critical role in materials research, as they directly affect material properties. With data-driven approaches increasingly accelerating materials discovery, there is growing interest in extracting synthesis procedures from scientific literature as structured data. However, existing studies often rely on rigid, domain-specific schemas with predefined fields for structuring synthesis procedures or assume that synthesis procedures are linear sequences of operations, which limits their ability to capture the structural complexity of real-world procedures. To address these limitations, we adopt PROV-DM, an international standard for provenance information, which supports flexible, graph-based modeling of procedures. We present MatPROV, a dataset of PROV-DM-compliant synthesis procedures extracted from scientific literature using large language models. MatPROV captures structural complexities and causal relationships among materials, operations, and conditions through visually intuitive directed graphs. This representation enables machine-interpretable synthesis knowledge, opening opportunities for future research such as automated synthesis planning and optimization.

Paper Structure

This paper contains 32 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of the dataset construction pipeline and its data representation. The example shows a simplified procedure; actual cases typically involve more complex, multi-step processes.
  • Figure 2: (a) Histogram of node counts per synthesis procedure graph in MatPROV. Synthesis backbones for (b) thermoelectric and (c) magnetic materials, represented by green nodes and edge. Edge weights represent the co-occurrence frequencies. See Appendix \ref{['sec:appendix_dataset_analysis']} for details.
  • Figure 3: Representative examples of synthesis procedure graphs extracted using o4-mini from the paper with DOI (a) "10.1002/advs.201600035 fu2016enhancing" and (b) "10.1155/2015/854840 he2015mossbauer."
  • Figure 4: (a) Distribution of synthesis procedures in MatPROV by material type. (b) Periodic table visualization of the elemental frequency in materials included in MatPROV.