Table of Contents
Fetching ...

MolTRES: Improving Chemical Language Representation Learning for Molecular Property Prediction

Jun-Hyung Park, Yeachan Kim, Mingyu Lee, Hyuntae Park, SangKeun Lee

TL;DR

MolTRES tackles the data scarcity and overfitting observed in SMILES-based pre-training for molecular property prediction by introducing two key innovations: DynaMol, a generator–discriminator dynamic that creates harder, more informative training examples, and mat2vec-based knowledge transfer that injects literature-derived molecular properties into embeddings. The approach pre-trains on billions of SMILES across PubChem and ZINC, and demonstrates state-of-the-art performance across most MoleculeNet classification and regression tasks, with ablations confirming the complementary contributions of both components. The findings highlight improved scalability and generalization for chemical language representation learning, offering a path toward more accurate property prediction and potential multi-modal extensions. Overall, MolTRES shows that increasing training difficulty and leveraging external knowledge dramatically enhance SMILES transformer representations for chemistry.

Abstract

Chemical representation learning has gained increasing interest due to the limited availability of supervised data in fields such as drug and materials design. This interest particularly extends to chemical language representation learning, which involves pre-training Transformers on SMILES sequences -- textual descriptors of molecules. Despite its success in molecular property prediction, current practices often lead to overfitting and limited scalability due to early convergence. In this paper, we introduce a novel chemical language representation learning framework, called MolTRES, to address these issues. MolTRES incorporates generator-discriminator training, allowing the model to learn from more challenging examples that require structural understanding. In addition, we enrich molecular representations by transferring knowledge from scientific literature by integrating external materials embedding. Experimental results show that our model outperforms existing state-of-the-art models on popular molecular property prediction tasks.

MolTRES: Improving Chemical Language Representation Learning for Molecular Property Prediction

TL;DR

MolTRES tackles the data scarcity and overfitting observed in SMILES-based pre-training for molecular property prediction by introducing two key innovations: DynaMol, a generator–discriminator dynamic that creates harder, more informative training examples, and mat2vec-based knowledge transfer that injects literature-derived molecular properties into embeddings. The approach pre-trains on billions of SMILES across PubChem and ZINC, and demonstrates state-of-the-art performance across most MoleculeNet classification and regression tasks, with ablations confirming the complementary contributions of both components. The findings highlight improved scalability and generalization for chemical language representation learning, offering a path toward more accurate property prediction and potential multi-modal extensions. Overall, MolTRES shows that increasing training difficulty and leveraging external knowledge dramatically enhance SMILES transformer representations for chemistry.

Abstract

Chemical representation learning has gained increasing interest due to the limited availability of supervised data in fields such as drug and materials design. This interest particularly extends to chemical language representation learning, which involves pre-training Transformers on SMILES sequences -- textual descriptors of molecules. Despite its success in molecular property prediction, current practices often lead to overfitting and limited scalability due to early convergence. In this paper, we introduce a novel chemical language representation learning framework, called MolTRES, to address these issues. MolTRES incorporates generator-discriminator training, allowing the model to learn from more challenging examples that require structural understanding. In addition, we enrich molecular representations by transferring knowledge from scientific literature by integrating external materials embedding. Experimental results show that our model outperforms existing state-of-the-art models on popular molecular property prediction tasks.
Paper Structure (29 sections, 8 equations, 5 figures, 7 tables)

This paper contains 29 sections, 8 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Existing pre-training methods for chemical language representation learning already converge at their early stage without seeing the entire data. Consequently, MoLFormer Ross-NMI2022, a state-of-the-art chemical language representation learning method, exhibits limited scalability in terms of data size.
  • Figure 2: Overview of MolTRES. EG and ED represent the embedding layers of the generator and discriminator, respectively. It is noteworthy that the mat2vec embeddings are frozen during pre-training.
  • Figure 3: Training curves of MolTRES with mat2vec embeddings (the solid line) and without mat2vec embeddings (the dashed line). The left shows the pre-training loss curves, while the right shows the average ROC-AUC scores.
  • Figure 4: Comparison of MolTRES for different masking ratios on MoleculeNet classification tasks.
  • Figure 5: Comparison of MolTRES for different $\lambda$ on MoleculeNet classification tasks.