Text to Band Gap: Pre-trained Language Models as Encoders for Semiconductor Band Gap Prediction
Ying-Ting Yeh, Janghoon Ock, Achuth Chandrasekhar, Shagun Maheshwari, Amir Barati Farimani
TL;DR
This work demonstrates that transformer-based language models can predict semiconductor band gaps directly from textual material descriptions, bypassing traditional feature engineering and graph-based structure encoding. By evaluating RoBERTa, T5, Llama-3, and MatSciBERT on an AFLOW-derived dataset with both structured and GPT-generated text, the study reports mean absolute errors in the range of $0.25$–$0.33$ eV, with Llama-3 achieving the best performance ($MAE=0.248$ eV, $R^2=0.891$) and MatSciBERT attaining strong results ($MAE=0.288$ eV, $R^2=0.871$) using domain-pretraining. Analyses of layer freezing, feature-wise attention, and embedding maps reveal how models allocate attention to composition and spin features over geometry and how finetuning reorients latent spaces toward property prediction, offering interpretability of the learned relationships. The findings indicate that text-based material descriptions can serve as a flexible input modality for rapid, end-to-end property estimation, enabling scalable screening when detailed structural data may be unavailable.
Abstract
We investigate transformer-based language models, including RoBERTa, T5, Llama-3, and MatSciBERT, for predicting the band gaps of semiconductor materials directly from textual descriptions. The inputs encode key material features, such as chemical composition, crystal system, space group, and other structural and electronic properties. Unlike shallow machine learning models, which require extensive feature engineering, or Graph Neural Networks, which rely on graph representations derived from atomic coordinates, pretrained language models can process textual inputs directly, eliminating the need for manual feature preprocessing or structure-based encoding. Material descriptions were constructed in two formats: structured strings with a consistent template and natural language narratives generated via the ChatGPT API. Each model was augmented with a custom regression head and finetuned for band gap prediction task. Language models of different architectures and parameter sizes were all able to predict band gaps from human-readable text with strong accuracy, achieving MAEs in the range of 0.25-0.33 eV, highlighting the success of this approach for scientific regression tasks. Finetuned Llama-3, with 1.2 billion parameters, achieved the highest accuracy (MAE 0.248 eV, R2 0.891). MatSciBERT, pretrained on materials science literature, reached comparable performance (MAE 0.288 eV, R2 0.871) with significantly fewer parameters (110 million), emphasizing the importance of domain-specific pretraining. Attention analysis shows that both models selectively focus on compositional and spin-related features while de-emphasizing geometric features, reflecting the difficulty of capturing spatial information from text. These results establish that pretrained language models can effectively extract complex feature-property relationships from textual material descriptions.
