Table of Contents
Fetching ...

AlloyBERT: Alloy Property Prediction with Large Language Models

Akshat Chaudhari, Chakradhar Guntuboina, Hongshuo Huang, Amir Barati Farimani

TL;DR

AlloyBERT introduces a RoBERTa-based transformer pipeline that predicts alloy properties from textual descriptions derived from composition and processing. By training a custom BPE tokenizer on domain text and performing masked language modeling before regression fine-tuning, the model achieves state-of-the-art error metrics on MPEA and RAYS relative to shallow baselines, with $MSE$ as low as $0.00015$ and $0.00611$ and $R^2$ up to $0.99$ and $0.83$, respectively. The approach demonstrates the viability of text-based, interpretable descriptions for alloy property prediction, offering a practical alternative to computationally intensive methods like DFT. This framework can accelerate alloy discovery by enabling rapid, data-driven property estimates from human-readable input.

Abstract

The pursuit of novel alloys tailored to specific requirements poses significant challenges for researchers in the field. This underscores the importance of developing predictive techniques for essential physical properties of alloys based on their chemical composition and processing parameters. This study introduces AlloyBERT, a transformer encoder-based model designed to predict properties such as elastic modulus and yield strength of alloys using textual inputs. Leveraging the pre-trained RoBERTa encoder model as its foundation, AlloyBERT employs self-attention mechanisms to establish meaningful relationships between words, enabling it to interpret human-readable input and predict target alloy properties. By combining a tokenizer trained on our textual data and a RoBERTa encoder pre-trained and fine-tuned for this specific task, we achieved a mean squared error (MSE) of 0.00015 on the Multi Principal Elemental Alloys (MPEA) data set and 0.00611 on the Refractory Alloy Yield Strength (RAYS) dataset. This surpasses the performance of shallow models, which achieved a best-case MSE of 0.00025 and 0.0076 on the MPEA and RAYS datasets respectively. Our results highlight the potential of language models in material science and establish a foundational framework for text-based prediction of alloy properties that does not rely on complex underlying representations, calculations, or simulations.

AlloyBERT: Alloy Property Prediction with Large Language Models

TL;DR

AlloyBERT introduces a RoBERTa-based transformer pipeline that predicts alloy properties from textual descriptions derived from composition and processing. By training a custom BPE tokenizer on domain text and performing masked language modeling before regression fine-tuning, the model achieves state-of-the-art error metrics on MPEA and RAYS relative to shallow baselines, with as low as and and up to and , respectively. The approach demonstrates the viability of text-based, interpretable descriptions for alloy property prediction, offering a practical alternative to computationally intensive methods like DFT. This framework can accelerate alloy discovery by enabling rapid, data-driven property estimates from human-readable input.

Abstract

The pursuit of novel alloys tailored to specific requirements poses significant challenges for researchers in the field. This underscores the importance of developing predictive techniques for essential physical properties of alloys based on their chemical composition and processing parameters. This study introduces AlloyBERT, a transformer encoder-based model designed to predict properties such as elastic modulus and yield strength of alloys using textual inputs. Leveraging the pre-trained RoBERTa encoder model as its foundation, AlloyBERT employs self-attention mechanisms to establish meaningful relationships between words, enabling it to interpret human-readable input and predict target alloy properties. By combining a tokenizer trained on our textual data and a RoBERTa encoder pre-trained and fine-tuned for this specific task, we achieved a mean squared error (MSE) of 0.00015 on the Multi Principal Elemental Alloys (MPEA) data set and 0.00611 on the Refractory Alloy Yield Strength (RAYS) dataset. This surpasses the performance of shallow models, which achieved a best-case MSE of 0.00025 and 0.0076 on the MPEA and RAYS datasets respectively. Our results highlight the potential of language models in material science and establish a foundational framework for text-based prediction of alloy properties that does not rely on complex underlying representations, calculations, or simulations.
Paper Structure (9 sections, 1 equation, 3 figures, 4 tables)

This paper contains 9 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of AlloyBERT: (a) Known properties of alloy are converted to elaborate textual description containing additional information about the constituents, processing and other physical properties. (b) Visualization of finetuning process. The embedding from the special token ‘< s>’ is input to the regression head, comprising a linear layer and activation layer. (c) Illustration of the Transformer encoder and multi-head attention mechanism.
  • Figure 2: Attention scores visualizationselfattention from AlloyBERT. Left column shows attention scores from the initial hidden layer and the right column shows attention scores from the final hidden layer. Top row is a String 4 text format from MPEA dataset and bottom row is a String 4 representation of input from RAYS dataset.
  • Figure 3: Parity plots for AlloyBERT predictions. X axis corresponds to actual values of elastic modulus in MPEA dataset and yield strength in RAYS dataset and Y axis corresponds to values predicted by AlloyBERT