Structure to Property: Chemical Element Embeddings and a Deep Learning Approach for Accurate Prediction of Chemical Properties

Shokirbek Shermukhamedov; Dilorom Mamurjonova; Michael Probst

Structure to Property: Chemical Element Embeddings and a Deep Learning Approach for Accurate Prediction of Chemical Properties

Shokirbek Shermukhamedov, Dilorom Mamurjonova, Michael Probst

TL;DR

This work addresses predicting chemical properties from structural information (the structure-to-property challenge) by introducing elEmBERT, a transformer-based encoder that combines local attention on atomic PDFs with a global aggregation, using element embeddings and sub-element tokens. The model operates on inputs derived from atomic pair distribution functions with a cutoff $r_ ext{cut}=10\,\text{Å}$ and scales the element vocabulary from $V_{size}=101$ to $V_{size}=565$ via sub-element decomposition, enabling permutation-invariant representations. It achieves state-of-the-art or competitive ROC-AUC performance across Matbench metallicity, LA, DIM, SG benchmarks, and multiple MoleculeNet toxicity tasks (e.g., Tox21 mean AUC $\approx 0.96$), with clear improvements in the V1 variant that uses sub-elements. The results demonstrate robust, cross-domain applicability to both materials and organic chemistry, and the work highlights the potential for pre-training and expanded sub-element vocabularies to further enhance predictive accuracy and interpretability.

Abstract

We introduce the elEmBERT model for chemical classification tasks. It is based on deep learning techniques, such as a multilayer encoder architecture. We demonstrate the opportunities offered by our approach on sets of organic, inorganic and crystalline compounds. In particular, we developed and tested the model using the Matbench and Moleculenet benchmarks, which include crystal properties and drug design-related benchmarks. We also conduct an analysis of vector representations of chemical compounds, shedding light on the underlying patterns in structural data. Our model exhibits exceptional predictive capabilities and proves universally applicable to molecular and material datasets. For instance, on the Tox21 dataset, we achieved an average precision of 96%, surpassing the previously best result by 10%.

Structure to Property: Chemical Element Embeddings and a Deep Learning Approach for Accurate Prediction of Chemical Properties

TL;DR

and scales the element vocabulary from

via sub-element decomposition, enabling permutation-invariant representations. It achieves state-of-the-art or competitive ROC-AUC performance across Matbench metallicity, LA, DIM, SG benchmarks, and multiple MoleculeNet toxicity tasks (e.g., Tox21 mean AUC

), with clear improvements in the V1 variant that uses sub-elements. The results demonstrate robust, cross-domain applicability to both materials and organic chemistry, and the work highlights the potential for pre-training and expanded sub-element vocabularies to further enhance predictive accuracy and interpretability.

Abstract

Paper Structure (21 sections, 16 figures, 6 tables)

This paper contains 21 sections, 16 figures, 6 tables.

Introduction
Methods
Model architecture
Datasets
Training procedure
Results
MP metallicity
LA, DIM and SG datasets
Tox21 dataset
Discussion
Conclusions
Acknowledgments
Availability
Sub-element approach
Organic benchmarks
...and 6 more sections

Figures (16)

Figure 1: elEmBERT model architecture. The initial step involves computing the pair distribution function for each element based on atom positions within the chemical compound. This information is then passed through the classifier model. Subsequently, the resulting sub-elements are converted into tokens, with additional tokens incorporated before input into the BERT module. The [CLS] token output vector from BERT is used for the classification task.
Figure 2: Sub-element classification: t-SNE Plots for Li (a) and Mg (b) atoms extracted from atomic PDFs of COD database.
Figure 3: Examples illustrating the division of elements into sub-elements based on their environment: a hypothetical organic compound (a) and Li$_8$CoO$_6$ (b) crystal with ID mp-27920. The numbers at the top right of elements correspond to sub-element indexes.
Figure 4: MP metallicity: Confusion matrix (a) and visualization of [CLS] token embeddings for the MP metallicity dataset for the reference (b) and predicted (c) datasets: blue circles denote negative labels (not metal) and orange dots represent positive labels (metal).
Figure 5: Classification task of inorganic compunds. Top row: Visualization of [CLS] Token Embeddings for the LA Dataset: a) reference labels and b) predicted labels. The embeddings are represented using blue circles for liquid phase labels and orange dots for amorphous labels. Bottom row: Confusion matrix analysis of the LA (c), DIM (d), and SG (e) datasets.
...and 11 more figures

Structure to Property: Chemical Element Embeddings and a Deep Learning Approach for Accurate Prediction of Chemical Properties

TL;DR

Abstract

Structure to Property: Chemical Element Embeddings and a Deep Learning Approach for Accurate Prediction of Chemical Properties

Authors

TL;DR

Abstract

Table of Contents

Figures (16)