Large Language Model-Driven Database for Thermoelectric Materials
Suman Itani, Yibo Zhang, Jiadong Zang
TL;DR
This work tackles the data bottleneck in thermoelectric material discovery by building a large, literature-derived database through an LLM-driven pipeline (GPTArticleExtractor) that automatically extracts structured thermoelectric and structural data from Elsevier publications. It delivers 7,123 compounds with properties including Seebeck coefficient, electrical and thermal conductivities, power factor, and figure of merit $ZT$, along with crystal and lattice descriptors and measurement temperatures. The dataset discriminates experimental and theoretical sources (about 66% experimental) and emphasizes data completeness and temperature dependence to support downstream ML and graph-based analyses. Openly accessible at nemad.org, the resource aims to accelerate data-driven discovery and optimization of thermoelectric materials for energy-efficient technologies.
Abstract
Thermoelectric materials provide a sustainable way to convert waste heat into electricity. However, data-driven discovery and optimization of these materials are challenging because of a lack of a reliable database. Here we developed a comprehensive database of 7,123 thermoelectric compounds, containing key information such as chemical composition, structural detail, seebeck coefficient, electrical and thermal conductivity, power factor, and figure of merit (ZT). We used the GPTArticleExtractor workflow, powered by large language models (LLM), to extract and curate data automatically from the scientific literature published in Elsevier journals. This process enabled the creation of a structured database that addresses the challenges of manual data collection. The open access database could stimulate data-driven research and advance thermoelectric material analysis and discovery.
