Table of Contents
Fetching ...

Automated Extraction of Material Properties using LLM-based AI Agents

Subham Ghosh, Abhishek Tewari

TL;DR

This work addresses the bottleneck of machine-readable materials data locked in unstructured literature by introducing an agentic LLM-based workflow to extract thermoelectric and structural properties from roughly 10,000 full-text articles. Using a LangGraph-based four-agent pipeline and zero-shot prompting, the approach yields 27,822 temperature-resolved records with normalized units, accompanied by a web-based explorer for querying and export. Benchmarking across GPT-4.1, GPT-4.1 Mini, and Gemini models demonstrates a cost–quality gradient, with GPT-4.1 delivering the highest accuracy (TE F1 ~0.91, structural F1 ~0.82) and GPT-4.1 Mini offering near-parity at a fraction of the cost. The dataset and open explorer establish a scalable, reproducible foundation for structure–property analyses and are readily generalizable to other materials domains such as batteries, catalysts, and magnets.

Abstract

The rapid discovery of materials is constrained by the lack of large, machine-readable datasets that couple performance metrics with structural context. Existing databases are either small, manually curated, or biased toward first principles results, leaving experimental literature underexploited. We present an agentic, large language model (LLM)-driven workflow that autonomously extracts thermoelectric and structural-properties from about 10,000 full-text scientific articles. The pipeline integrates dynamic token allocation, zeroshot multi-agent extraction, and conditional table parsing to balance accuracy against computational cost. Benchmarking on 50 curated papers shows that GPT-4.1 achieves the highest accuracy (F1 = 0.91 for thermoelectric properties and 0.82 for structural fields), while GPT-4.1 Mini delivers nearly comparable performance (F1 = 0.89 and 0.81) at a fraction of the cost, enabling practical large scale deployment. Applying this workflow, we curated 27,822 temperature resolved property records with normalized units, spanning figure of merit (ZT), Seebeck coefficient, conductivity, resistivity, power factor, and thermal conductivity, together with structural attributes such as crystal class, space group, and doping strategy. Dataset analysis reproduces known thermoelectric trends, such as the superior performance of alloys over oxides and the advantage of p-type doping, while also surfacing broader structure-property correlations. To facilitate community access, we release an interactive web explorer with semantic filters, numeric queries, and CSV export. This study delivers the largest LLM-curated thermoelectric dataset to date, provides a reproducible and cost-profiled extraction pipeline, and establishes a foundation for scalable, data-driven materials discovery beyond thermoelectrics.

Automated Extraction of Material Properties using LLM-based AI Agents

TL;DR

This work addresses the bottleneck of machine-readable materials data locked in unstructured literature by introducing an agentic LLM-based workflow to extract thermoelectric and structural properties from roughly 10,000 full-text articles. Using a LangGraph-based four-agent pipeline and zero-shot prompting, the approach yields 27,822 temperature-resolved records with normalized units, accompanied by a web-based explorer for querying and export. Benchmarking across GPT-4.1, GPT-4.1 Mini, and Gemini models demonstrates a cost–quality gradient, with GPT-4.1 delivering the highest accuracy (TE F1 ~0.91, structural F1 ~0.82) and GPT-4.1 Mini offering near-parity at a fraction of the cost. The dataset and open explorer establish a scalable, reproducible foundation for structure–property analyses and are readily generalizable to other materials domains such as batteries, catalysts, and magnets.

Abstract

The rapid discovery of materials is constrained by the lack of large, machine-readable datasets that couple performance metrics with structural context. Existing databases are either small, manually curated, or biased toward first principles results, leaving experimental literature underexploited. We present an agentic, large language model (LLM)-driven workflow that autonomously extracts thermoelectric and structural-properties from about 10,000 full-text scientific articles. The pipeline integrates dynamic token allocation, zeroshot multi-agent extraction, and conditional table parsing to balance accuracy against computational cost. Benchmarking on 50 curated papers shows that GPT-4.1 achieves the highest accuracy (F1 = 0.91 for thermoelectric properties and 0.82 for structural fields), while GPT-4.1 Mini delivers nearly comparable performance (F1 = 0.89 and 0.81) at a fraction of the cost, enabling practical large scale deployment. Applying this workflow, we curated 27,822 temperature resolved property records with normalized units, spanning figure of merit (ZT), Seebeck coefficient, conductivity, resistivity, power factor, and thermal conductivity, together with structural attributes such as crystal class, space group, and doping strategy. Dataset analysis reproduces known thermoelectric trends, such as the superior performance of alloys over oxides and the advantage of p-type doping, while also surfacing broader structure-property correlations. To facilitate community access, we release an interactive web explorer with semantic filters, numeric queries, and CSV export. This study delivers the largest LLM-curated thermoelectric dataset to date, provides a reproducible and cost-profiled extraction pipeline, and establishes a foundation for scalable, data-driven materials discovery beyond thermoelectrics.

Paper Structure

This paper contains 16 sections, 2 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Automated workflow for article retrieval and preprocessing, featuring document fetching, noise removal, tokenization, and metadata enrichment to generate LLM-ready datasets
  • Figure 2: Agentic LangGraphlanggraph2024 workflow for extracting thermoelectric and structural properties using LLMs. The system dynamically allocates tokens, performs zero-shot extraction, and conditionally processes tabular data before saving structured outputs.
  • Figure 3: Token pricing comparison for GPTopenai_pricing and Geminigoogle_gemini_pricing models (input/output cost per 1M tokens).
  • Figure 4: Coverage percentage of extracted thermoelectric and structural properties across the curated dataset.
  • Figure 5: Counts of thermoelectric property records with and without corresponding temperature information.
  • ...and 7 more figures