Table of Contents
Fetching ...

From Tokens to Materials: Leveraging Language Models for Scientific Discovery

Yuwei Wan, Tong Xie, Nan Wu, Wenjie Zhang, Chunyu Kit, Bram Hoex

TL;DR

It is demonstrated that domain-specific models, particularly MatBERT significantly outperform general-purpose models in extracting implicit knowledge from compound names and material properties and identifies a crucial "tokenizer effect", highlighting the importance of specialized text processing techniques that preserve complete compound names while maintaining consistent token counts.

Abstract

Exploring the predictive capabilities of language models in material science is an ongoing interest. This study investigates the application of language model embeddings to enhance material property prediction in materials science. By evaluating various contextual embedding methods and pre-trained models, including Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformers (GPT), we demonstrate that domain-specific models, particularly MatBERT significantly outperform general-purpose models in extracting implicit knowledge from compound names and material properties. Our findings reveal that information-dense embeddings from the third layer of MatBERT, combined with a context-averaging approach, offer the most effective method for capturing material-property relationships from the scientific literature. We also identify a crucial "tokenizer effect," highlighting the importance of specialized text processing techniques that preserve complete compound names while maintaining consistent token counts. These insights underscore the value of domain-specific training and tokenization in materials science applications and offer a promising pathway for accelerating the discovery and development of new materials through AI-driven approaches.

From Tokens to Materials: Leveraging Language Models for Scientific Discovery

TL;DR

It is demonstrated that domain-specific models, particularly MatBERT significantly outperform general-purpose models in extracting implicit knowledge from compound names and material properties and identifies a crucial "tokenizer effect", highlighting the importance of specialized text processing techniques that preserve complete compound names while maintaining consistent token counts.

Abstract

Exploring the predictive capabilities of language models in material science is an ongoing interest. This study investigates the application of language model embeddings to enhance material property prediction in materials science. By evaluating various contextual embedding methods and pre-trained models, including Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformers (GPT), we demonstrate that domain-specific models, particularly MatBERT significantly outperform general-purpose models in extracting implicit knowledge from compound names and material properties. Our findings reveal that information-dense embeddings from the third layer of MatBERT, combined with a context-averaging approach, offer the most effective method for capturing material-property relationships from the scientific literature. We also identify a crucial "tokenizer effect," highlighting the importance of specialized text processing techniques that preserve complete compound names while maintaining consistent token counts. These insights underscore the value of domain-specific training and tokenization in materials science applications and offer a promising pathway for accelerating the discovery and development of new materials through AI-driven approaches.

Paper Structure

This paper contains 16 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: A diagram of obtaining context-free and context-average embedding for Ca2Co2O5. For both context-free and context-average methods, material names are segmented into tokens and input to the token embedding model BERT. Taking the vectors from certain layer of the model's output as token embeddings, the embedding of a material name is the average of the embeddings of its tokens. The difference between these two methods is that, in context-average method, material names are accompanied by several context sentences, and the final material representation is the average over material embeddings regarding different context sentences.
  • Figure 2: Results of thermoelectrical material prediction using context-free method. Correlation_1 refers to the Spearman correlation between predicted ranking and standard ranking of the 84 materials in our evaluation dataset, reflecting the performance of pre-trained token embedding models – BERT and MatBERT, on the task of applicational material prediction. Correlation_2 refers to the Spearman correlation between the predicted ranking and the ranked sequence of material token lengths, suggesting that the tokenized length of material names is an influential factor over task outcome.
  • Figure 3: Results of thermoelectrical material prediction using context-average method. As pre-trained token embedding models output 13 layers of hidden states, each layer of high-dimensional vectors are extracted and employed as token embeddings separately. The layer from which token embeddings are extracted demonstrates impact over task result, and certain layer shows significant advantage over other layers.