Structure to Property: Chemical Element Embeddings and a Deep Learning Approach for Accurate Prediction of Chemical Properties
Shokirbek Shermukhamedov, Dilorom Mamurjonova, Michael Probst
TL;DR
This work addresses predicting chemical properties from structural information (the structure-to-property challenge) by introducing elEmBERT, a transformer-based encoder that combines local attention on atomic PDFs with a global aggregation, using element embeddings and sub-element tokens. The model operates on inputs derived from atomic pair distribution functions with a cutoff $r_ ext{cut}=10\,\text{Å}$ and scales the element vocabulary from $V_{size}=101$ to $V_{size}=565$ via sub-element decomposition, enabling permutation-invariant representations. It achieves state-of-the-art or competitive ROC-AUC performance across Matbench metallicity, LA, DIM, SG benchmarks, and multiple MoleculeNet toxicity tasks (e.g., Tox21 mean AUC $\approx 0.96$), with clear improvements in the V1 variant that uses sub-elements. The results demonstrate robust, cross-domain applicability to both materials and organic chemistry, and the work highlights the potential for pre-training and expanded sub-element vocabularies to further enhance predictive accuracy and interpretability.
Abstract
We introduce the elEmBERT model for chemical classification tasks. It is based on deep learning techniques, such as a multilayer encoder architecture. We demonstrate the opportunities offered by our approach on sets of organic, inorganic and crystalline compounds. In particular, we developed and tested the model using the Matbench and Moleculenet benchmarks, which include crystal properties and drug design-related benchmarks. We also conduct an analysis of vector representations of chemical compounds, shedding light on the underlying patterns in structural data. Our model exhibits exceptional predictive capabilities and proves universally applicable to molecular and material datasets. For instance, on the Tox21 dataset, we achieved an average precision of 96%, surpassing the previously best result by 10%.
