Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

Luca Foppiano; Guillaume Lambard; Toshiyuki Amagasa; Masashi Ishii

Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

Luca Foppiano, Guillaume Lambard, Toshiyuki Amagasa, Masashi Ishii

TL;DR

This study benchmarks large language models (GPT-3.5-Turbo, GPT-4, GPT-4-Turbo) against BERT-based baselines for information extraction in materials science, focusing on NER of materials and properties and RE between them. A novel formula-matching approach is introduced to normalize material expressions and provide a robust evaluation metric, enabling meaningful comparisons across surface variations. Results show LLMs struggle to outperform specialized models on materials and properties NER, though GPT-4 family demonstrates strong RE capabilities without fine-tuning and GPT-3.5-Turbo can excel with targeted fine-tuning for RE; for materials NER, small, domain-specialized models currently perform best. The paper offers a reproducible evaluation framework and practical guidance for applying LLMs to materials-subdomain IE tasks, with potential extensions to other MI domains.

Abstract

This study is dedicated to assessing the capabilities of large language models (LLMs) such as GPT-3.5-Turbo, GPT-4, and GPT-4-Turbo in extracting structured information from scientific documents in materials science. To this end, we primarily focus on two critical tasks of information extraction: (i) a named entity recognition (NER) of studied materials and physical properties and (ii) a relation extraction (RE) between these entities. Due to the evident lack of datasets within Materials Informatics (MI), we evaluated using SuperMat, based on superconductor research, and MeasEval, a generic measurement evaluation corpus. The performance of LLMs in executing these tasks is benchmarked against traditional models based on the BERT architecture and rule-based approaches (baseline). We introduce a novel methodology for the comparative analysis of intricate material expressions, emphasising the standardisation of chemical formulas to tackle the complexities inherent in materials science information assessment. For NER, LLMs fail to outperform the baseline with zero-shot prompting and exhibit only limited improvement with few-shot prompting. However, a GPT-3.5-Turbo fine-tuned with the appropriate strategy for RE outperforms all models, including the baseline. Without any fine-tuning, GPT-4 and GPT-4-Turbo display remarkable reasoning and relationship extraction capabilities after being provided with merely a couple of examples, surpassing the baseline. Overall, the results suggest that although LLMs demonstrate relevant reasoning skills in connecting concepts, specialised models are currently a better choice for tasks requiring extracting complex domain-specific entities like materials. These insights provide initial guidance applicable to other materials science sub-domains in future work.

Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

TL;DR

Abstract

Paper Structure (30 sections, 6 figures, 33 tables)

This paper contains 30 sections, 6 figures, 33 tables.

Introduction
Method
Named Entities Recognition
Output format
Formula matching
Relation Extraction
Shuffled vs non-shuffled evaluation
Consideration about the fine-tuning
Results and discussions
Limitation of this study
Formula matching
NER on properties extraction
NER on materials expressions extraction
Relation extraction
Data variability for fine-tuning
...and 15 more sections

Figures (6)

Figure 1: Two materials that appear to have a very different composition are, in reality, overlapping. (Top) Summary of the Material Parser. More information is available in lfoppiano2023automatic. (Bottom) The pairwise comparison of each chemical formula is performed element-by-element.
Figure 2: Comparison scores for properties extraction using NER. The scores are the aggregations of the micro average F1 scores and are calculated using soft matching with a threshold of 0.9 similarity. The error bars are calculated over the standard deviation of three independent runs.
Figure 3: Comparison scores for material extraction using NER. The metrics are the aggregations of the micro average F1-scores, calculated using formula matching. The error bars are calculated over the standard deviation of three independent runs.
Figure 4: Comparison of the scores of the shuffled extraction using zero-shot prompting, few-shot prompting and the fine-tuned model for RE on materials and properties. The metrics are the aggregated micro average F1-scores calculated using strict matching. The error bars are calculated over the standard deviation of three independent runs.
Figure 5: Overview evaluation on shuffling the provided entities in RE on materials and properties. The metrics are the aggregated micro average F1-scores calculated using strict matching. The error bars are calculated over the standard deviation of three independent runs.
...and 1 more figures

Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

TL;DR

Abstract

Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

Authors

TL;DR

Abstract

Table of Contents

Figures (6)