Automated Extraction of Material Properties using LLM-based AI Agents
Subham Ghosh, Abhishek Tewari
TL;DR
This work addresses the bottleneck of machine-readable materials data locked in unstructured literature by introducing an agentic LLM-based workflow to extract thermoelectric and structural properties from roughly 10,000 full-text articles. Using a LangGraph-based four-agent pipeline and zero-shot prompting, the approach yields 27,822 temperature-resolved records with normalized units, accompanied by a web-based explorer for querying and export. Benchmarking across GPT-4.1, GPT-4.1 Mini, and Gemini models demonstrates a cost–quality gradient, with GPT-4.1 delivering the highest accuracy (TE F1 ~0.91, structural F1 ~0.82) and GPT-4.1 Mini offering near-parity at a fraction of the cost. The dataset and open explorer establish a scalable, reproducible foundation for structure–property analyses and are readily generalizable to other materials domains such as batteries, catalysts, and magnets.
Abstract
The rapid discovery of materials is constrained by the lack of large, machine-readable datasets that couple performance metrics with structural context. Existing databases are either small, manually curated, or biased toward first principles results, leaving experimental literature underexploited. We present an agentic, large language model (LLM)-driven workflow that autonomously extracts thermoelectric and structural-properties from about 10,000 full-text scientific articles. The pipeline integrates dynamic token allocation, zeroshot multi-agent extraction, and conditional table parsing to balance accuracy against computational cost. Benchmarking on 50 curated papers shows that GPT-4.1 achieves the highest accuracy (F1 = 0.91 for thermoelectric properties and 0.82 for structural fields), while GPT-4.1 Mini delivers nearly comparable performance (F1 = 0.89 and 0.81) at a fraction of the cost, enabling practical large scale deployment. Applying this workflow, we curated 27,822 temperature resolved property records with normalized units, spanning figure of merit (ZT), Seebeck coefficient, conductivity, resistivity, power factor, and thermal conductivity, together with structural attributes such as crystal class, space group, and doping strategy. Dataset analysis reproduces known thermoelectric trends, such as the superior performance of alloys over oxides and the advantage of p-type doping, while also surfacing broader structure-property correlations. To facilitate community access, we release an interactive web explorer with semantic filters, numeric queries, and CSV export. This study delivers the largest LLM-curated thermoelectric dataset to date, provides a reproducible and cost-profiled extraction pipeline, and establishes a foundation for scalable, data-driven materials discovery beyond thermoelectrics.
