From Text to Insight: Large Language Models for Materials Science Data Extraction

Mara Schilling-Wilhelmi; Martiño Ríos-García; Sherjeel Shabih; María Victoria Gil; Santiago Miret; Christoph T. Koch; José A. Márquez; Kevin Maik Jablonka

From Text to Insight: Large Language Models for Materials Science Data Extraction

Mara Schilling-Wilhelmi, Martiño Ríos-García, Sherjeel Shabih, María Victoria Gil, Santiago Miret, Christoph T. Koch, José A. Márquez, Kevin Maik Jablonka

TL;DR

<3-5 sentence high-level summary>This work surveys the use of large language models (LLMs) to extract structured data from unstructured materials-science text, emphasizing end-to-end workflows that connect data collection, preprocessing, and validation with domain knowledge. It articulates practical strategies for prompting, fine-tuning, and agentic approaches, including multimodal and retrieval-augmented methods, to overcome context-length and verification challenges. The authors propose frameworks for evaluation, data normalization, and knowledge-grounded validation, and highlight frontiers such as cross-document linking, multimodal integration, and bias mitigation. The review aims to accelerate data-driven materials discovery by providing actionable guidance and pointing to benchmarks and open questions for the field.

Abstract

The vast majority of materials science knowledge exists in unstructured natural language, yet structured data is crucial for innovative and systematic materials design. Traditionally, the field has relied on manual curation and partial automation for data extraction for specific use cases. The advent of large language models (LLMs) represents a significant shift, potentially enabling efficient extraction of structured, actionable data from unstructured text by non-experts. While applying LLMs to materials science data extraction presents unique challenges, domain knowledge offers opportunities to guide and validate LLM outputs. This review provides a comprehensive overview of LLM-based structured data extraction in materials science, synthesizing current knowledge and outlining future directions. We address the lack of standardized guidelines and present frameworks for leveraging the synergy between LLMs and materials science expertise. This work serves as a foundational resource for researchers aiming to harness LLMs for data-driven materials research. The insights presented here could significantly enhance how researchers across disciplines access and utilize scientific information, potentially accelerating the development of novel materials for critical societal needs.

From Text to Insight: Large Language Models for Materials Science Data Extraction

TL;DR

Abstract

Paper Structure (67 sections, 3 equations, 10 figures, 3 tables)

This paper contains 67 sections, 3 equations, 10 figures, 3 tables.

Introduction
Overview of the working principles of LLMs
Sampling outputs
Embeddings
Training and tuning of LLMs
-systems
Structured data extraction workflow
Preprocessing
Obtaining data
Tools for data mining
Importance of structured data
Curating and cleaning data
Document parsing and understanding
Document cleaning
Dealing with finite context
...and 52 more sections

Figures (10)

Figure 1: Number of research papers vs. datasets deposited in data repositories in materials science and chemistry per year. The top graph shows the number of publications from 1996 to 2023. The number of records was obtained from the search queries "(nanoparticles)", "(battery AND cathode AND materials)", "(photocatalytic AND materials)", "(polymers)", "(thermoelectric AND materials)", "((metal-organic AND framework) OR MOF)", "(biomaterials)", "((2D AND materials) OR graphene)", and "(semiconductor AND materials)" in the Web of Science Core Collection on July 1, 2024 (search based on title, abstract and indexing, including "Article" and "Data Paper" document types) (categories based on Kononova_2021) (see https://matextract.pub/content/intro_figure/figure1_intro_notebook.html). The two graphs below show the number of datasets in chemistry and materials science deposited in the Zenodo and DataCite repositories from 1996 to 2023. The number of records was obtained from similar queries by restricting the document type to "Dataset". Note the different $y$-axis scale between the top and bottom graphs. While this figure highlights the large difference in the availability of structured datasets compared to papers, we note that a one-to-one comparison of these numbers is not always fair. This is because sometimes multiple papers can be used to create a single dataset, as in the case of curated databases, or vice versa, where multiple structured datasets can result from a single paper's work.
Figure 2: High-level explanation of the working principle of an LLM. The data flow in the image corresponds to a decoder-only model, e.g., a GPT or Llama model. One token is produced each time, considering all the tokens from the input and all previously produced tokens. The process starts with the tokenizer, which converts the user query into smaller constituent units, the tokens. The tokens are passed into the model, where input embeddings are computed, after which additional operations transform the embeddings. As a result, the model outputs the probabilities over possible subsequent tokens. Depending on the temperature parameter, the most probable token or a less probable one is chosen, and the sampled token is added to the input. By repeating this process, the model generates the response to the query.
Figure 3: Data extraction workflow. This figure illustrates the flow of data from left to right through various stages of the extraction process. The evaluation loop includes all steps in the workflow, indicating that if evaluations do not yield satisfactory results, corrections and improvements may be necessary at any stage. It is important to conduct these evaluations using a representative and labeled test set, rather than the entire unstructured data corpus. Once the evaluations demonstrate satisfactory results, the entire corpus of unstructured data can be processed.
Figure 4: Data preprocessing workflow. The process from the mined articles to machine-readable and cleaned format, which one could send to a . For articles for which the relevant information cannot readily be extracted using conventional tools, might be a suitable alternative (see \ref{['chap:multimodal_models']}).
Figure 5: Decision tree to help decide what chunking strategy to use. If the input text is quite short, no chunking is required. In contrast, if the information is spread across a very large corpus, can provide cost and efficiency benefits. Chunking is typically applied with . In this case, one can use semantic chunking if it provides chunks that fit into the context window. The most simple option is to chunk text using a fixed window size.
...and 5 more figures

From Text to Insight: Large Language Models for Materials Science Data Extraction

TL;DR

Abstract

From Text to Insight: Large Language Models for Materials Science Data Extraction

Authors

TL;DR

Abstract

Table of Contents

Figures (10)