Sampling Latent Material-Property Information From LLM-Derived Embedding Representations
Luke P. J. Gilligan, Matteo Cobelli, Hasan M. Sayeed, Taylor D. Sparks, Stefano Sanvito
TL;DR
Latent material-property information can be extracted from LLM embeddings without task-specific training, but only with carefully engineered prompts and contextual cues. The authors compare two embedding strategies using a 13B Llama 2 model: direct formula embeddings and composition-weighted elemental embeddings with contextualization terms. They quantify ranking quality with Spearman correlation against ground-truth properties such as Curie temperature, thermoelectric power factor, and band gap. Direct embeddings underperform, while context-driven elemental embeddings can achieve moderate correlations for magnetism ($\rho > 0.5$) but remain inconsistent for other properties, indicating potential but limited out-of-the-box utility. The results imply that contextual prompts and domain knowledge can unlock useful material representations, yet robust, generalizable performance requires more systematic context-engineering and possibly domain-specific data.
Abstract
Vector embeddings derived from large language models (LLMs) show promise in capturing latent information from the literature. Interestingly, these can be integrated into material embeddings, potentially useful for data-driven predictions of materials properties. We investigate the extent to which LLM-derived vectors capture the desired information and their potential to provide insights into material properties without additional training. Our findings indicate that, although LLMs can be used to generate representations reflecting certain property information, extracting the embeddings requires identifying the optimal contextual clues and appropriate comparators. Despite this restriction, it appears that LLMs still have the potential to be useful in generating meaningful materials-science representations.
