Table of Contents
Fetching ...

Thinking like a CHEMIST: Combined Heterogeneous Embedding Model Integrating Structure and Tokens

Nikolai Rekut, Alexey Orlov, Klea Ziu, Elizaveta Starykh, Martin Takac, Aleksandr Beznosikov

TL;DR

This work addresses the challenge of chemically meaningful molecular representations by decomposing molecules into BRICS-based substructures and computing descriptor sequences that feed a RoBERTa language model, while simultaneously leveraging graph-based backbones (GIN, GCN, Graphormer) in a bimodal, contrastive framework. The model aligns language-derived and graph-derived embeddings in a shared latent space using projection blocks and NTXent losses, with a total loss L = $\alpha L_{lang} + \beta L_{graph} + \gamma L_{bimodal}$ and temperature parameter $\tau$ guiding contrastive terms. Empirical results on QSAR benchmarks demonstrate improved performance over SMILES-based and some graph-only baselines, with Graphormer offering advantages on large datasets and simpler architectures (GIN/GCN) performing well with limited data. The approach provides interpretable, fragment-level chemical representations and shows practical potential for scalable, multimodal molecular modeling, though it acknowledges limitations on very small or chemically distinct (inorganic/polymeric) domains and advocates scaling up pretraining to tens of millions of molecules.

Abstract

Representing molecular structures effectively in chemistry remains a challenging task. Language models and graph-based models are extensively utilized within this domain, consistently achieving state-of-the-art results across an array of tasks. However, the prevailing practice of representing chemical compounds in the SMILES format - used by most data sets and many language models - presents notable limitations as a training data format. In this study, we present a novel approach that decomposes molecules into substructures and computes descriptor-based representations for these fragments, providing more detailed and chemically relevant input for model training. We use this substructure and descriptor data as input for language model and also propose a bimodal architecture that integrates this language model with graph-based models. As LM we use RoBERTa, Graph Isomorphism Networks (GIN), Graph Convolutional Networks (GCN) and Graphormer as graph ones. Our framework shows notable improvements over traditional methods in various tasks such as Quantitative Structure-Activity Relationship (QSAR) prediction.

Thinking like a CHEMIST: Combined Heterogeneous Embedding Model Integrating Structure and Tokens

TL;DR

This work addresses the challenge of chemically meaningful molecular representations by decomposing molecules into BRICS-based substructures and computing descriptor sequences that feed a RoBERTa language model, while simultaneously leveraging graph-based backbones (GIN, GCN, Graphormer) in a bimodal, contrastive framework. The model aligns language-derived and graph-derived embeddings in a shared latent space using projection blocks and NTXent losses, with a total loss L = and temperature parameter guiding contrastive terms. Empirical results on QSAR benchmarks demonstrate improved performance over SMILES-based and some graph-only baselines, with Graphormer offering advantages on large datasets and simpler architectures (GIN/GCN) performing well with limited data. The approach provides interpretable, fragment-level chemical representations and shows practical potential for scalable, multimodal molecular modeling, though it acknowledges limitations on very small or chemically distinct (inorganic/polymeric) domains and advocates scaling up pretraining to tens of millions of molecules.

Abstract

Representing molecular structures effectively in chemistry remains a challenging task. Language models and graph-based models are extensively utilized within this domain, consistently achieving state-of-the-art results across an array of tasks. However, the prevailing practice of representing chemical compounds in the SMILES format - used by most data sets and many language models - presents notable limitations as a training data format. In this study, we present a novel approach that decomposes molecules into substructures and computes descriptor-based representations for these fragments, providing more detailed and chemically relevant input for model training. We use this substructure and descriptor data as input for language model and also propose a bimodal architecture that integrates this language model with graph-based models. As LM we use RoBERTa, Graph Isomorphism Networks (GIN), Graph Convolutional Networks (GCN) and Graphormer as graph ones. Our framework shows notable improvements over traditional methods in various tasks such as Quantitative Structure-Activity Relationship (QSAR) prediction.

Paper Structure

This paper contains 47 sections, 11 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: An example of splitting molecule to substructures and creating the additional substructure (in case of serotonin, its the parent molecule due to it was splitted to only two BRICS blocks).
  • Figure 2: Example of tokenization process. Toketns "0" and "2" correspond to BOS (begin of sequence) and EOS (end of sequence), respectively. The '!' and '$' kept non-tokenizenised for clarity.
  • Figure 3: Full architecture of bimodal model. Language and Graph blocks are outlined by blue and orange colors. Red color marks projection blocks.
  • Figure 4: Tops masking process and computing the graph loss for one batch.
  • Figure 5: The structure of the projection block. It helps to translate output vectors from models to the same linear space.
  • ...and 1 more figures