Table of Contents
Fetching ...

Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Datasets

Satanu Ghosh, Neal R. Brodnik, Carolina Frey, Collin Holgate, Tresa M. Pollock, Samantha Daly, Samuel Carton

TL;DR

This study tests GPT-4's ability to perform ad-hoc, schema-based information extraction from scientific literature by attempting to reproduce two manually curated materials datasets (MPEA and diffusion in silicate melts) without fine-tuning. It compares zero-shot, one-shot, and LangChain prompting, coupled with minimal postprocessing, and introduces a row-alignment scheme to evaluate matches, misses, and hallucinations. An expert-driven error analysis reveals that most failures arise from figure-bound data, PDF/XML parsing limitations, non-standard table formats, and unit-conversion issues, with narrative data occasionally aiding extraction. The findings identify concrete research directions—native PDF support, multimodal data access, robust table-comprehension, explicit narrative-to-table relationships, and deeper, discipline-specific schemas—that are essential for advancing reliable ad-hoc scientific information extraction in materials science and beyond.

Abstract

We explore the ability of GPT-4 to perform ad-hoc schema based information extraction from scientific literature. We assess specifically whether it can, with a basic prompting approach, replicate two existing material science datasets, given the manuscripts from which they were originally manually extracted. We employ materials scientists to perform a detailed manual error analysis to assess where the model struggles to faithfully extract the desired information, and draw on their insights to suggest research directions to address this broadly important task.

Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Datasets

TL;DR

This study tests GPT-4's ability to perform ad-hoc, schema-based information extraction from scientific literature by attempting to reproduce two manually curated materials datasets (MPEA and diffusion in silicate melts) without fine-tuning. It compares zero-shot, one-shot, and LangChain prompting, coupled with minimal postprocessing, and introduces a row-alignment scheme to evaluate matches, misses, and hallucinations. An expert-driven error analysis reveals that most failures arise from figure-bound data, PDF/XML parsing limitations, non-standard table formats, and unit-conversion issues, with narrative data occasionally aiding extraction. The findings identify concrete research directions—native PDF support, multimodal data access, robust table-comprehension, explicit narrative-to-table relationships, and deeper, discipline-specific schemas—that are essential for advancing reliable ad-hoc scientific information extraction in materials science and beyond.

Abstract

We explore the ability of GPT-4 to perform ad-hoc schema based information extraction from scientific literature. We assess specifically whether it can, with a basic prompting approach, replicate two existing material science datasets, given the manuscripts from which they were originally manually extracted. We employ materials scientists to perform a detailed manual error analysis to assess where the model struggles to faithfully extract the desired information, and draw on their insights to suggest research directions to address this broadly important task.
Paper Structure (46 sections, 14 figures, 4 tables)

This paper contains 46 sections, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Extracting large-scale structured data from scientific literature should be as simple as specifying a schema, a corpus of manuscripts, and a few exemplars, and letting the LLM perform the extraction.
  • Figure 2: Abbreviated one-shot prompt. The prompt begins with role-setting, includes a single exemplar with prompt instructions, then repeated prompt instructions with additional context and clarifications.
  • Figure 3: Counts of rows with key information in different formats.
  • Figure 4: Proportions of differing error reasons, divided by error type (missed row vs. hallucination) and dataset.
  • Figure 5: Plot to visualize the variance of properties found in two datasets. High variance means properties that are altered in experiments and low variance show experimental constants.
  • ...and 9 more figures