Structured information extraction from complex scientific text with fine-tuned large language models
Alexander Dunn, John Dagdelen, Nicholas Walker, Sanghoon Lee, Andrew S. Rosen, Gerbrand Ceder, Kristin Persson, Anubhav Jain
TL;DR
This work introduces a straightforward prompt–completion pipeline that fine-tunes GPT-3 to perform joint named entity recognition and relation extraction (NERRE) on complex scientific text, yielding either English summaries or structured JSON outputs. It applies the approach to three materials science tasks—solid-state doping, MOF cataloging, and general materials information extraction—demonstrating strong information-extraction performance and substantial practical benefits through an in-the-loop annotation workflow. Although exact-match sequence reconstruction can be challenging for longer, more complex outputs, the method achieves high parsing rates and competitive or superior NERRE performance compared with baselines, with manual evaluation showing strong domain-appropriate results. The approach is presented as accessible, flexible, and readily transferable to other domains, with an online demo and potential for rapid assembly of large, structured knowledge graphs from unstructured literature.
Abstract
Intelligently extracting and linking complex scientific information from unstructured text is a challenging endeavor particularly for those inexperienced with natural language processing. Here, we present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction for complex hierarchical information in scientific text. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts (inputs) and completions (outputs). Information is extracted either from single sentences or across sentences in abstracts/passages, and the output can be returned as simple English sentences or a more structured format, such as a list of JSON objects. We demonstrate that LLMs trained in this way are capable of accurately extracting useful records of complex scientific knowledge for three representative tasks in materials chemistry: linking dopants with their host materials, cataloging metal-organic frameworks, and general chemistry/phase/morphology/application information extraction. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text. An online demo is available at http://www.matscholar.com/info-extraction.
