Text to Band Gap: Pre-trained Language Models as Encoders for Semiconductor Band Gap Prediction

Ying-Ting Yeh; Janghoon Ock; Achuth Chandrasekhar; Shagun Maheshwari; Amir Barati Farimani

Text to Band Gap: Pre-trained Language Models as Encoders for Semiconductor Band Gap Prediction

Ying-Ting Yeh, Janghoon Ock, Achuth Chandrasekhar, Shagun Maheshwari, Amir Barati Farimani

TL;DR

This work demonstrates that transformer-based language models can predict semiconductor band gaps directly from textual material descriptions, bypassing traditional feature engineering and graph-based structure encoding. By evaluating RoBERTa, T5, Llama-3, and MatSciBERT on an AFLOW-derived dataset with both structured and GPT-generated text, the study reports mean absolute errors in the range of $0.25$–$0.33$ eV, with Llama-3 achieving the best performance ($MAE=0.248$ eV, $R^2=0.891$) and MatSciBERT attaining strong results ($MAE=0.288$ eV, $R^2=0.871$) using domain-pretraining. Analyses of layer freezing, feature-wise attention, and embedding maps reveal how models allocate attention to composition and spin features over geometry and how finetuning reorients latent spaces toward property prediction, offering interpretability of the learned relationships. The findings indicate that text-based material descriptions can serve as a flexible input modality for rapid, end-to-end property estimation, enabling scalable screening when detailed structural data may be unavailable.

Abstract

We investigate transformer-based language models, including RoBERTa, T5, Llama-3, and MatSciBERT, for predicting the band gaps of semiconductor materials directly from textual descriptions. The inputs encode key material features, such as chemical composition, crystal system, space group, and other structural and electronic properties. Unlike shallow machine learning models, which require extensive feature engineering, or Graph Neural Networks, which rely on graph representations derived from atomic coordinates, pretrained language models can process textual inputs directly, eliminating the need for manual feature preprocessing or structure-based encoding. Material descriptions were constructed in two formats: structured strings with a consistent template and natural language narratives generated via the ChatGPT API. Each model was augmented with a custom regression head and finetuned for band gap prediction task. Language models of different architectures and parameter sizes were all able to predict band gaps from human-readable text with strong accuracy, achieving MAEs in the range of 0.25-0.33 eV, highlighting the success of this approach for scientific regression tasks. Finetuned Llama-3, with 1.2 billion parameters, achieved the highest accuracy (MAE 0.248 eV, R2 0.891). MatSciBERT, pretrained on materials science literature, reached comparable performance (MAE 0.288 eV, R2 0.871) with significantly fewer parameters (110 million), emphasizing the importance of domain-specific pretraining. Attention analysis shows that both models selectively focus on compositional and spin-related features while de-emphasizing geometric features, reflecting the difficulty of capturing spatial information from text. These results establish that pretrained language models can effectively extract complex feature-property relationships from textual material descriptions.

Text to Band Gap: Pre-trained Language Models as Encoders for Semiconductor Band Gap Prediction

TL;DR

–

eV, with Llama-3 achieving the best performance (

eV,

) and MatSciBERT attaining strong results (

eV,

) using domain-pretraining. Analyses of layer freezing, feature-wise attention, and embedding maps reveal how models allocate attention to composition and spin features over geometry and how finetuning reorients latent spaces toward property prediction, offering interpretability of the learned relationships. The findings indicate that text-based material descriptions can serve as a flexible input modality for rapid, end-to-end property estimation, enabling scalable screening when detailed structural data may be unavailable.

Abstract

Paper Structure (17 sections, 4 equations, 6 figures, 3 tables)

This paper contains 17 sections, 4 equations, 6 figures, 3 tables.

Introduction
Methods
RoBERTa
T5
Llama-3
MatSciBERT
Shallow ML Models
Dataset
Text Data Format
Input Features
Results and Discussion
Framework
Model Performance
Layer Freezing Analysis
Feature-wise Self-Attention Score
...and 2 more sections

Figures (6)

Figure 1: Overview of the proposed band gap prediction framework. a The pipeline starts from the AFLOW dataset, followed by feature selection, dataset preparation, and LLM model training for final band gap prediction. b Two input formats are illustrated. string-based representation using direct feature values and description-based format generated by GPT-3.5 turbo. c Visualization of the finetuning process. The input text undergoes tokenization and embedding through multiple model architectures (RoBERTa, T5, Llama-3, MatSciBERT), followed by a custom regression head for prediction. d Demonstrates the Transformer encoder and the multi-head attention mechanism with Query (Q), Key (K), and Value (V) operations.
Figure 2: Parity plots for band gap predictions across models: a SVR, b XGBoost, c Random Forest, d RoBERTa, e T5, f Llama-3, g MatSciBERT
Figure 3: Scaling behavior of finetuning strategies across transformer-based models. MAE is shown as a function of the number of trainable parameters. Colors indicate different model architectures: RoBERTa (yellow), T5 (green), Llama-3 (blue), and MatSciBERT (red). Marker shapes represent different freezing strategies, from fully finetuned (no freezing) to full layer freezing.
Figure 4: Feature-wise self-attention scores for LLaMA-3 and MatSciBERT. a LLaMA-3, first layer; b LLaMA-3, final layer (layer 16); c MatSciBERT, first layer; d MatSciBERT, final layer (layer 12).
Figure 5: t-SNE visualizations of embeddings colored by crystal system. a-d show results from the pretrained models: a RoBERTa, b T5, c Llama-3, d MatSciBERT. e-h show results from the corresponding finetuned models: e RoBERTa, f T5, g Llama-3, h MatSciBERT.
...and 1 more figures

Text to Band Gap: Pre-trained Language Models as Encoders for Semiconductor Band Gap Prediction

TL;DR

Abstract

Text to Band Gap: Pre-trained Language Models as Encoders for Semiconductor Band Gap Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (6)