Table of Contents
Fetching ...

Reliable End-to-End Material Information Extraction from the Literature with Source-Tracked Multi-Stage Large Language Models

Xin Wang, Anshu Raj, Matthew Luebbe, Haiming Wen, Shuozhi Xu, Kun Lu

TL;DR

This work tackles the challenge of harvesting comprehensive composition–processing–microstructure–property information from unstructured materials literature by introducing a source-tracked, four-stage information extraction pipeline powered by large language models. The pipeline systematically identifies materials, then iteratively extracts microstructure and mechanical properties before a final validation, maintaining explicit links to source texts to ensure traceability. Across 100 articles and 396 materials, the approach achieves near state-of-the-art F1 scores around 0.96 and dramatically reduces missed materials from 12.4% to 3.3% while attaining zero false-positive materials, enabling reliable, scalable construction of materials knowledge bases. Although powerful, the method reveals limitations in domain-specific reasoning and multimodal data handling, suggesting future work on broader material classes, multimodal extraction, and knowledge-graph integration to advance data-driven materials informatics.

Abstract

Data-driven materials discovery requires large-scale experimental datasets, yet most of the information remains trapped in unstructured literature. Existing extraction efforts often focus on a limited set of features and have not addressed the integrated composition-processing-microstructure-property relationships essential for understanding materials behavior, thereby posing challenges for building comprehensive databases. To address this gap, we propose a multi-stage information extraction pipeline powered by large language models, which captures 47 features spanning composition, processing, microstructure, and properties exclusively from experimentally reported materials. The pipeline integrates iterative extraction with source tracking to enhance both accuracy and reliability. Evaluations at the feature level (independent attributes) and tuple level (interdependent features) yielded F1 scores around 0.96. Compared with single-pass extraction without source tracking, our approach improved F1 scores of microstructure category by 10.0% (feature level) and 13.7% (tuple level), and reduced missed materials from 49 to 13 out of 396 materials in 100 articles on precipitate-containing multi-principal element alloys (miss rate reduced from 12.4% to 3.3%). The pipeline enables scalable and efficient literature mining, producing databases with high precision, minimal omissions, and zero false positives. These datasets provide trustworthy inputs for machine learning and materials informatics, while the modular design generalizes to diverse material classes, enabling comprehensive materials information extraction.

Reliable End-to-End Material Information Extraction from the Literature with Source-Tracked Multi-Stage Large Language Models

TL;DR

This work tackles the challenge of harvesting comprehensive composition–processing–microstructure–property information from unstructured materials literature by introducing a source-tracked, four-stage information extraction pipeline powered by large language models. The pipeline systematically identifies materials, then iteratively extracts microstructure and mechanical properties before a final validation, maintaining explicit links to source texts to ensure traceability. Across 100 articles and 396 materials, the approach achieves near state-of-the-art F1 scores around 0.96 and dramatically reduces missed materials from 12.4% to 3.3% while attaining zero false-positive materials, enabling reliable, scalable construction of materials knowledge bases. Although powerful, the method reveals limitations in domain-specific reasoning and multimodal data handling, suggesting future work on broader material classes, multimodal extraction, and knowledge-graph integration to advance data-driven materials informatics.

Abstract

Data-driven materials discovery requires large-scale experimental datasets, yet most of the information remains trapped in unstructured literature. Existing extraction efforts often focus on a limited set of features and have not addressed the integrated composition-processing-microstructure-property relationships essential for understanding materials behavior, thereby posing challenges for building comprehensive databases. To address this gap, we propose a multi-stage information extraction pipeline powered by large language models, which captures 47 features spanning composition, processing, microstructure, and properties exclusively from experimentally reported materials. The pipeline integrates iterative extraction with source tracking to enhance both accuracy and reliability. Evaluations at the feature level (independent attributes) and tuple level (interdependent features) yielded F1 scores around 0.96. Compared with single-pass extraction without source tracking, our approach improved F1 scores of microstructure category by 10.0% (feature level) and 13.7% (tuple level), and reduced missed materials from 49 to 13 out of 396 materials in 100 articles on precipitate-containing multi-principal element alloys (miss rate reduced from 12.4% to 3.3%). The pipeline enables scalable and efficient literature mining, producing databases with high precision, minimal omissions, and zero false positives. These datasets provide trustworthy inputs for machine learning and materials informatics, while the modular design generalizes to diverse material classes, enabling comprehensive materials information extraction.

Paper Structure

This paper contains 20 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Extraction framework showing the 47 features extracted across four categories (composition, processing, microstructure, and properties) and the specialized prompt design for each category. Here, "NRT-Cryo" means non-room temperature cryogenic conditions.
  • Figure 2: Architecture of our multi-stage extraction pipelines (with/without source tracking).
  • Figure 3: Processes of three extraction pipelines.
  • Figure 4: Performance comparison of three extraction pipelines across 47 features organized into composition, processing, microstructure, and properties categories. a Feature-level F1 score comparison showing individual feature extraction accuracy for each pipeline across the four categories. b Feature-level accuracy comparison demonstrating the proportion of correctly extracted features within each category. c Tuple-level F1 score comparison evaluating the extraction of interdependent feature groups, applied specifically to processing and microstructure categories where features exhibit strong relationships. d Tuple-level accuracy comparison measuring the correct extraction of complete feature tuples.