MolTRES: Improving Chemical Language Representation Learning for Molecular Property Prediction
Jun-Hyung Park, Yeachan Kim, Mingyu Lee, Hyuntae Park, SangKeun Lee
TL;DR
MolTRES tackles the data scarcity and overfitting observed in SMILES-based pre-training for molecular property prediction by introducing two key innovations: DynaMol, a generator–discriminator dynamic that creates harder, more informative training examples, and mat2vec-based knowledge transfer that injects literature-derived molecular properties into embeddings. The approach pre-trains on billions of SMILES across PubChem and ZINC, and demonstrates state-of-the-art performance across most MoleculeNet classification and regression tasks, with ablations confirming the complementary contributions of both components. The findings highlight improved scalability and generalization for chemical language representation learning, offering a path toward more accurate property prediction and potential multi-modal extensions. Overall, MolTRES shows that increasing training difficulty and leveraging external knowledge dramatically enhance SMILES transformer representations for chemistry.
Abstract
Chemical representation learning has gained increasing interest due to the limited availability of supervised data in fields such as drug and materials design. This interest particularly extends to chemical language representation learning, which involves pre-training Transformers on SMILES sequences -- textual descriptors of molecules. Despite its success in molecular property prediction, current practices often lead to overfitting and limited scalability due to early convergence. In this paper, we introduce a novel chemical language representation learning framework, called MolTRES, to address these issues. MolTRES incorporates generator-discriminator training, allowing the model to learn from more challenging examples that require structural understanding. In addition, we enrich molecular representations by transferring knowledge from scientific literature by integrating external materials embedding. Experimental results show that our model outperforms existing state-of-the-art models on popular molecular property prediction tasks.
