Table of Contents
Fetching ...

Structured information extraction from complex scientific text with fine-tuned large language models

Alexander Dunn, John Dagdelen, Nicholas Walker, Sanghoon Lee, Andrew S. Rosen, Gerbrand Ceder, Kristin Persson, Anubhav Jain

TL;DR

This work introduces a straightforward prompt–completion pipeline that fine-tunes GPT-3 to perform joint named entity recognition and relation extraction (NERRE) on complex scientific text, yielding either English summaries or structured JSON outputs. It applies the approach to three materials science tasks—solid-state doping, MOF cataloging, and general materials information extraction—demonstrating strong information-extraction performance and substantial practical benefits through an in-the-loop annotation workflow. Although exact-match sequence reconstruction can be challenging for longer, more complex outputs, the method achieves high parsing rates and competitive or superior NERRE performance compared with baselines, with manual evaluation showing strong domain-appropriate results. The approach is presented as accessible, flexible, and readily transferable to other domains, with an online demo and potential for rapid assembly of large, structured knowledge graphs from unstructured literature.

Abstract

Intelligently extracting and linking complex scientific information from unstructured text is a challenging endeavor particularly for those inexperienced with natural language processing. Here, we present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction for complex hierarchical information in scientific text. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts (inputs) and completions (outputs). Information is extracted either from single sentences or across sentences in abstracts/passages, and the output can be returned as simple English sentences or a more structured format, such as a list of JSON objects. We demonstrate that LLMs trained in this way are capable of accurately extracting useful records of complex scientific knowledge for three representative tasks in materials chemistry: linking dopants with their host materials, cataloging metal-organic frameworks, and general chemistry/phase/morphology/application information extraction. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text. An online demo is available at http://www.matscholar.com/info-extraction.

Structured information extraction from complex scientific text with fine-tuned large language models

TL;DR

This work introduces a straightforward prompt–completion pipeline that fine-tunes GPT-3 to perform joint named entity recognition and relation extraction (NERRE) on complex scientific text, yielding either English summaries or structured JSON outputs. It applies the approach to three materials science tasks—solid-state doping, MOF cataloging, and general materials information extraction—demonstrating strong information-extraction performance and substantial practical benefits through an in-the-loop annotation workflow. Although exact-match sequence reconstruction can be challenging for longer, more complex outputs, the method achieves high parsing rates and competitive or superior NERRE performance compared with baselines, with manual evaluation showing strong domain-appropriate results. The approach is presented as accessible, flexible, and readily transferable to other domains, with an online demo and potential for rapid assembly of large, structured knowledge graphs from unstructured literature.

Abstract

Intelligently extracting and linking complex scientific information from unstructured text is a challenging endeavor particularly for those inexperienced with natural language processing. Here, we present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction for complex hierarchical information in scientific text. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts (inputs) and completions (outputs). Information is extracted either from single sentences or across sentences in abstracts/passages, and the output can be returned as simple English sentences or a more structured format, such as a list of JSON objects. We demonstrate that LLMs trained in this way are capable of accurately extracting useful records of complex scientific knowledge for three representative tasks in materials chemistry: linking dopants with their host materials, cataloging metal-organic frameworks, and general chemistry/phase/morphology/application information extraction. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text. An online demo is available at http://www.matscholar.com/info-extraction.
Paper Structure (8 sections, 6 equations, 9 figures, 10 tables)

This paper contains 8 sections, 6 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Simplified comparison of previous relation extraction (RE) models and the $seq2seq$-LLM approach proposed in this work.
  • Figure 2: Overview of our sequence-to-sequence approach to document-level joint named entity recognition and relationship extraction task. In the first step, lists of JSON documents are prepared from abstracts according to a predefined schema, and a GPT-3 model is trained. In the second step, this preliminary (intermediate) model is used to accelerate the preparation of additional training data by pre-annotation with the partially trained model and manual correction. This step may be repeated multiple times with each subsequent partial fine-tuning improving in performance. In the final step, GPT-3 is fine-tuned on the complete dataset and used for inference to extract desired information from new text.
  • Figure 3: Annotation schema example for the doping extraction task. A raw sentence text sequence prompt $p$ is passed into an LLM-NERRE doping model which produces a structured sequence completion $c$. The structured completion format depends on the model; here, we train two separate models Doping-ENG and Doping-JSON with structured completion formats of English sentences and JSON, respectively. These models are completely independent for training and evaluation purposes. As an optional final step, the completions may be decoded and post-processed from string literals into hierarchical structured graph objects ($G$) for further analysis. Note this final step is separate from the NERRE models themselves, as the graph objects are decoded programmatically to a variety of formats (e.g., JSON, NetworkX objects networkx); a more complex hierarchical graph example for the doping task is shown in Supplementary Figure \ref{['fig:supp-doping-graph']}.
  • Figure 4: Annotation schema example for the general materials-chemistry extraction task. A raw full-abstract text prompt $p$ is passed to the General-JSON model which produces a structured completion $c$ in JSON schema. The JSON schema is a list of individual material entries ordered by appearance in the text, each of which may have a name, formula, acronym, descriptors, applications, and/or phase label. The structured completion may then be programmatically decoded to a hierarchical materials graph $G$ without a ML model.
  • Figure 5: Annotation schema example for the metal--organic frameworks extraction task. A raw full-abstract text prompt $p$ is fed into the MOF-JSON model which produces a structured output sequence $c$ similar to that of the General-JSON model shown in Fig. \ref{['fig:example-general']}. The output sequence is a formatted string literal which can be directly loaded as JSON. The string may then be optionally decoded to a graph $G$ for further analysis. In this example, only the MOF name and application were extracted from the passage, and both MOFs (LaBTB and ZrPDA) are linked to both applications (luminescent and VOC sensor).
  • ...and 4 more figures