Table of Contents
Fetching ...

Dynamic In-context Learning with Conversational Models for Data Extraction and Materials Property Prediction

Chinedu Ekuma

TL;DR

PropertyExtractor tackles the challenge of trustworthy data extraction from unstructured scholarly text by integrating zero-shot and few-shot in-context learning within conversational LLMs such as Google Gemini Pro and OpenAI GPT-4. The toolkit employs engineered prompts, dynamic prompt updates, regex-assisted extraction, and self-critique to produce structured material-property quadruples and to verify data accuracy. On thickness data for 2D materials and energy-bandgap data, it achieves high metrics (thickness: $P=95.74%$, $R=93.75%$, $F1=94.73%$, $Acc=90.00%$, $E_r=10.00%$; bandgap: $P=96.81%$, $R=94.72%$, $F1=95.21%$, $Acc=92.05%$, $E_r=7.95%$). The open-source design emphasizes adaptability to future LLMs and supports automated generation of property databases and downstream tasks such as knowledge graphs.

Abstract

The advent of natural language processing and large language models (LLMs) has revolutionized the extraction of data from unstructured scholarly papers. However, ensuring data trustworthiness remains a significant challenge. In this paper, we introduce PropertyExtractor, an open-source tool that leverages advanced conversational LLMs like Google gemini-pro and OpenAI gpt-4, blends zero-shot with few-shot in-context learning, and employs engineered prompts for the dynamic refinement of structured information hierarchies - enabling autonomous, efficient, scalable, and accurate identification, extraction, and verification of material property data. Our tests on material data demonstrate precision and recall that exceed 95\% with an error rate of approximately 9%, highlighting the effectiveness and versatility of the toolkit. Finally, databases for 2D material thicknesses, a critical parameter for device integration, and energy bandgap values are developed using PropertyExtractor. Specifically for the thickness database, the rapid evolution of the field has outpaced both experimental measurements and computational methods, creating a significant data gap. Our work addresses this gap and showcases the potential of PropertyExtractor as a reliable and efficient tool for the autonomous generation of various material property databases, advancing the field.

Dynamic In-context Learning with Conversational Models for Data Extraction and Materials Property Prediction

TL;DR

PropertyExtractor tackles the challenge of trustworthy data extraction from unstructured scholarly text by integrating zero-shot and few-shot in-context learning within conversational LLMs such as Google Gemini Pro and OpenAI GPT-4. The toolkit employs engineered prompts, dynamic prompt updates, regex-assisted extraction, and self-critique to produce structured material-property quadruples and to verify data accuracy. On thickness data for 2D materials and energy-bandgap data, it achieves high metrics (thickness: , , , , ; bandgap: , , , , ). The open-source design emphasizes adaptability to future LLMs and supports automated generation of property databases and downstream tasks such as knowledge graphs.

Abstract

The advent of natural language processing and large language models (LLMs) has revolutionized the extraction of data from unstructured scholarly papers. However, ensuring data trustworthiness remains a significant challenge. In this paper, we introduce PropertyExtractor, an open-source tool that leverages advanced conversational LLMs like Google gemini-pro and OpenAI gpt-4, blends zero-shot with few-shot in-context learning, and employs engineered prompts for the dynamic refinement of structured information hierarchies - enabling autonomous, efficient, scalable, and accurate identification, extraction, and verification of material property data. Our tests on material data demonstrate precision and recall that exceed 95\% with an error rate of approximately 9%, highlighting the effectiveness and versatility of the toolkit. Finally, databases for 2D material thicknesses, a critical parameter for device integration, and energy bandgap values are developed using PropertyExtractor. Specifically for the thickness database, the rapid evolution of the field has outpaced both experimental measurements and computational methods, creating a significant data gap. Our work addresses this gap and showcases the potential of PropertyExtractor as a reliable and efficient tool for the autonomous generation of various material property databases, advancing the field.
Paper Structure (8 sections, 1 equation, 4 figures, 1 table)

This paper contains 8 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Summarized flowchart of the PropertyExtractor code for obtaining structured dataset with a conversational large language model. The flow diagram provides the basic ideas for each step of the process and illustrates the integration with the API for obtaining unstructured scientific papers. A more detailed flowchart is presented in Figure \ref{['fig1_2']}.
  • Figure 2: Flowchart of the PropertyExtractor code for obtaining structured dataset with a conversational large language model. The flow diagram is made up several interrelated prompt sections guiding the framework for the successful and accurate extraction of materials properties in the form of the quadriplets []material, property value, original unit, method].
  • Figure 3: Material property database with PropertyExtractor. A snapshot of the extracted thickness data for atomically thin 2D materials, illustrating the range and diversity of the autonomously obtained database.
  • Figure 4: Performance Evaluation of PropertyExtractor. Confusion matrix comparing the ground truth with data extracted by PropertyExtractor, showcasing 45 true positives, 2 false positives, and 3 false negatives, which are used to calculate the model's precision, recall, accuracy, and error metrics.