Table of Contents
Fetching ...

Large Language Model-Driven Database for Thermoelectric Materials

Suman Itani, Yibo Zhang, Jiadong Zang

TL;DR

This work tackles the data bottleneck in thermoelectric material discovery by building a large, literature-derived database through an LLM-driven pipeline (GPTArticleExtractor) that automatically extracts structured thermoelectric and structural data from Elsevier publications. It delivers 7,123 compounds with properties including Seebeck coefficient, electrical and thermal conductivities, power factor, and figure of merit $ZT$, along with crystal and lattice descriptors and measurement temperatures. The dataset discriminates experimental and theoretical sources (about 66% experimental) and emphasizes data completeness and temperature dependence to support downstream ML and graph-based analyses. Openly accessible at nemad.org, the resource aims to accelerate data-driven discovery and optimization of thermoelectric materials for energy-efficient technologies.

Abstract

Thermoelectric materials provide a sustainable way to convert waste heat into electricity. However, data-driven discovery and optimization of these materials are challenging because of a lack of a reliable database. Here we developed a comprehensive database of 7,123 thermoelectric compounds, containing key information such as chemical composition, structural detail, seebeck coefficient, electrical and thermal conductivity, power factor, and figure of merit (ZT). We used the GPTArticleExtractor workflow, powered by large language models (LLM), to extract and curate data automatically from the scientific literature published in Elsevier journals. This process enabled the creation of a structured database that addresses the challenges of manual data collection. The open access database could stimulate data-driven research and advance thermoelectric material analysis and discovery.

Large Language Model-Driven Database for Thermoelectric Materials

TL;DR

This work tackles the data bottleneck in thermoelectric material discovery by building a large, literature-derived database through an LLM-driven pipeline (GPTArticleExtractor) that automatically extracts structured thermoelectric and structural data from Elsevier publications. It delivers 7,123 compounds with properties including Seebeck coefficient, electrical and thermal conductivities, power factor, and figure of merit , along with crystal and lattice descriptors and measurement temperatures. The dataset discriminates experimental and theoretical sources (about 66% experimental) and emphasizes data completeness and temperature dependence to support downstream ML and graph-based analyses. Openly accessible at nemad.org, the resource aims to accelerate data-driven discovery and optimization of thermoelectric materials for energy-efficient technologies.

Abstract

Thermoelectric materials provide a sustainable way to convert waste heat into electricity. However, data-driven discovery and optimization of these materials are challenging because of a lack of a reliable database. Here we developed a comprehensive database of 7,123 thermoelectric compounds, containing key information such as chemical composition, structural detail, seebeck coefficient, electrical and thermal conductivity, power factor, and figure of merit (ZT). We used the GPTArticleExtractor workflow, powered by large language models (LLM), to extract and curate data automatically from the scientific literature published in Elsevier journals. This process enabled the creation of a structured database that addresses the challenges of manual data collection. The open access database could stimulate data-driven research and advance thermoelectric material analysis and discovery.
Paper Structure (14 sections, 2 equations, 7 figures, 1 table)

This paper contains 14 sections, 2 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Workflow for constructing a database of thermoelectric materials. Scientific articles, identified by their DOIs, are accessed via API requests and processed as XML files. Using text and table parsers, the XML files are converted into plain-text CSV format. The GPTArticleExtractor, leveraging the GPT-4 model, is then applied to extract relevant information from the plain text and organize it into structured JSON lists. These JSON files are further refined and combined to create the final database, containing detailed thermoelectric and structural properties of the materials.
  • Figure 2: Bar chart plot for the number of compounds for each properties in our database.
  • Figure 3: Distribution of Seebeck Coefficient in Database
  • Figure 4: Distribution of Electrical Conductivity in Database
  • Figure 5: Distribution of Thermal Conductivity in Database
  • ...and 2 more figures