Table of Contents
Fetching ...

Sampling Latent Material-Property Information From LLM-Derived Embedding Representations

Luke P. J. Gilligan, Matteo Cobelli, Hasan M. Sayeed, Taylor D. Sparks, Stefano Sanvito

TL;DR

Latent material-property information can be extracted from LLM embeddings without task-specific training, but only with carefully engineered prompts and contextual cues. The authors compare two embedding strategies using a 13B Llama 2 model: direct formula embeddings and composition-weighted elemental embeddings with contextualization terms. They quantify ranking quality with Spearman correlation against ground-truth properties such as Curie temperature, thermoelectric power factor, and band gap. Direct embeddings underperform, while context-driven elemental embeddings can achieve moderate correlations for magnetism ($\rho > 0.5$) but remain inconsistent for other properties, indicating potential but limited out-of-the-box utility. The results imply that contextual prompts and domain knowledge can unlock useful material representations, yet robust, generalizable performance requires more systematic context-engineering and possibly domain-specific data.

Abstract

Vector embeddings derived from large language models (LLMs) show promise in capturing latent information from the literature. Interestingly, these can be integrated into material embeddings, potentially useful for data-driven predictions of materials properties. We investigate the extent to which LLM-derived vectors capture the desired information and their potential to provide insights into material properties without additional training. Our findings indicate that, although LLMs can be used to generate representations reflecting certain property information, extracting the embeddings requires identifying the optimal contextual clues and appropriate comparators. Despite this restriction, it appears that LLMs still have the potential to be useful in generating meaningful materials-science representations.

Sampling Latent Material-Property Information From LLM-Derived Embedding Representations

TL;DR

Latent material-property information can be extracted from LLM embeddings without task-specific training, but only with carefully engineered prompts and contextual cues. The authors compare two embedding strategies using a 13B Llama 2 model: direct formula embeddings and composition-weighted elemental embeddings with contextualization terms. They quantify ranking quality with Spearman correlation against ground-truth properties such as Curie temperature, thermoelectric power factor, and band gap. Direct embeddings underperform, while context-driven elemental embeddings can achieve moderate correlations for magnetism () but remain inconsistent for other properties, indicating potential but limited out-of-the-box utility. The results imply that contextual prompts and domain knowledge can unlock useful material representations, yet robust, generalizable performance requires more systematic context-engineering and possibly domain-specific data.

Abstract

Vector embeddings derived from large language models (LLMs) show promise in capturing latent information from the literature. Interestingly, these can be integrated into material embeddings, potentially useful for data-driven predictions of materials properties. We investigate the extent to which LLM-derived vectors capture the desired information and their potential to provide insights into material properties without additional training. Our findings indicate that, although LLMs can be used to generate representations reflecting certain property information, extracting the embeddings requires identifying the optimal contextual clues and appropriate comparators. Despite this restriction, it appears that LLMs still have the potential to be useful in generating meaningful materials-science representations.
Paper Structure (4 sections, 3 equations, 7 figures)

This paper contains 4 sections, 3 equations, 7 figures.

Figures (7)

  • Figure 1: Parity plots comparing the World Bank 2022 GDP ranking ('GDP ranking') with the country cosine similarity against the string 'gross domestic product'. Here the embeddings are derived from the final-layer LLM representation: (a) without any context and (b) by providing the contextual phrase 'economy of' before the country name. Both rankings are compiled using the largest Llama 2 model (13B parameters). The colours encode the number of countries presenting that particular ranking.
  • Figure 2: Parity plots comparing the ground-truth Curie-temperature ranking with the ranking based on the cosine similarity of the embedding vectors with different magnetic keywords (reported above each graph). In this case, the chemical formulae are directly embedded by the LLM. All rankings are compiled using the largest Llama 2 model (13B parameters). The Spearman rank correlation of each plot is reported in the legends. The colours encode the number of compounds presenting that particular $T_\mathrm{C}$ ranking.
  • Figure 3: Parity plots comparing the ground-truth $T_\mathrm{C}$ ranking with the ranking based on the cosine similarity of the embedding vectors with different magnetic keywords (reported above each graph). In this case, each compound is embedded through the composition-averaged elemental embedding, having 'ferromagnet' as a contextualization term. All rankings are compiled using the largest Llama 2 model (13B parameters). The Spearman rank correlation of each plot is reported in the legends. The colours encode the number of compounds presenting that particular $T_\mathrm{C}$ ranking.
  • Figure 4: A heat map of the Spearman rank correlation coefficient, $\rho$, for different choices of contextualization terms and query keys. This is computed against the ground truth Curie-temperature database. The first row corresponds to composition-averaged elemental embedding in which no contextualization term was introduced, while in the first column, the query key is an empty string. The systematically best-performing query key is 'iron'.
  • Figure 5: Parity plots comparing the ground-truth thermoelectric power factor with the ranking based on the cosine similarity of the embedding vectors with different thermoelectric-related query keys (reported above each graph). In this case, each compound is embedded through the composition-averaged elemental embedding, having 'thermoelectric' as the contextualization key. All rankings are compiled using the largest Llama 2 model (13B parameters). The Spearman rank correlation of each plot is reported in the legends. The colours encode the number of compounds presenting that particular power-factor ranking.
  • ...and 2 more figures