Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Datasets
Satanu Ghosh, Neal R. Brodnik, Carolina Frey, Collin Holgate, Tresa M. Pollock, Samantha Daly, Samuel Carton
TL;DR
This study tests GPT-4's ability to perform ad-hoc, schema-based information extraction from scientific literature by attempting to reproduce two manually curated materials datasets (MPEA and diffusion in silicate melts) without fine-tuning. It compares zero-shot, one-shot, and LangChain prompting, coupled with minimal postprocessing, and introduces a row-alignment scheme to evaluate matches, misses, and hallucinations. An expert-driven error analysis reveals that most failures arise from figure-bound data, PDF/XML parsing limitations, non-standard table formats, and unit-conversion issues, with narrative data occasionally aiding extraction. The findings identify concrete research directions—native PDF support, multimodal data access, robust table-comprehension, explicit narrative-to-table relationships, and deeper, discipline-specific schemas—that are essential for advancing reliable ad-hoc scientific information extraction in materials science and beyond.
Abstract
We explore the ability of GPT-4 to perform ad-hoc schema based information extraction from scientific literature. We assess specifically whether it can, with a basic prompting approach, replicate two existing material science datasets, given the manuscripts from which they were originally manually extracted. We employ materials scientists to perform a detailed manual error analysis to assess where the model struggles to faithfully extract the desired information, and draw on their insights to suggest research directions to address this broadly important task.
