Table of Contents
Fetching ...

Pre-trained Molecular Language Models with Random Functional Group Masking

Tianhao Peng, Yuchen Li, Xuhong Li, Jiang Bian, Zeke Xie, Ning Sui, Shahid Mumtaz, Yanwu Xu, Linghe Kong, Haoyi Xiong

TL;DR

Results indicate that MLM-FG effectively learns to interpret molecular properties from SMILES, offering a powerful new tool for computational chemistry and related disciplines.

Abstract

Recent advancements in computational chemistry have leveraged the power of trans-former-based language models, such as MoLFormer, pre-trained using a vast amount of simplified molecular-input line-entry system (SMILES) sequences, to understand and predict molecular properties and activities, a critical step in fields like drug discovery and materials science. To further improve performance, researchers have introduced graph neural networks with graph-based molecular representations, such as GEM, incorporating the topology, geometry, 2D or even 3D structures of molecules into pre-training. While most of molecular graphs in existing studies were automatically converted from SMILES sequences, it is to assume that transformer-based language models might be able to implicitly learn structure-aware representations from SMILES sequences. In this paper, we propose \ours{} -- a SMILES-based \underline{\em M}olecular \underline{\em L}anguage \underline{\em M}odel, which randomly masking SMILES subsequences corresponding to specific molecular \underline{\em F}unctional \underline{\em G}roups to incorporate structure information of atoms during the pre-training phase. This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities. Extensive experimental evaluations across 11 benchmark classification and regression tasks in the chemical domain demonstrate the robustness and superiority of \ours{}. Our findings reveal that \ours{} outperforms existing pre-training models, either based on SMILES or graphs, in 9 out of the 11 downstream tasks, ranking as a close second in the remaining ones.

Pre-trained Molecular Language Models with Random Functional Group Masking

TL;DR

Results indicate that MLM-FG effectively learns to interpret molecular properties from SMILES, offering a powerful new tool for computational chemistry and related disciplines.

Abstract

Recent advancements in computational chemistry have leveraged the power of trans-former-based language models, such as MoLFormer, pre-trained using a vast amount of simplified molecular-input line-entry system (SMILES) sequences, to understand and predict molecular properties and activities, a critical step in fields like drug discovery and materials science. To further improve performance, researchers have introduced graph neural networks with graph-based molecular representations, such as GEM, incorporating the topology, geometry, 2D or even 3D structures of molecules into pre-training. While most of molecular graphs in existing studies were automatically converted from SMILES sequences, it is to assume that transformer-based language models might be able to implicitly learn structure-aware representations from SMILES sequences. In this paper, we propose \ours{} -- a SMILES-based \underline{\em M}olecular \underline{\em L}anguage \underline{\em M}odel, which randomly masking SMILES subsequences corresponding to specific molecular \underline{\em F}unctional \underline{\em G}roups to incorporate structure information of atoms during the pre-training phase. This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities. Extensive experimental evaluations across 11 benchmark classification and regression tasks in the chemical domain demonstrate the robustness and superiority of \ours{}. Our findings reveal that \ours{} outperforms existing pre-training models, either based on SMILES or graphs, in 9 out of the 11 downstream tasks, ranking as a close second in the remaining ones.

Paper Structure

This paper contains 18 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An illustration of the proposed MLM-FG framework: (1) MLM-FG adopts 12-layer multi-head transformer blocks (in either RoBERTa or MoLFormer architectures) with a hidden state dimension of $D_h$=768 for pre-training and fine-tuning, (2) MLM-FG follows a functional group-aware random masking strategy to pre-train the model on a large corpus of 10 to 100 million SMILES sequences from PubChem, and (3) MLM-FG fine-tunes the pre-trained models to support a wide range of molecular machine learning applications.
  • Figure 2: Visualization of molecular representations learned by MLM-FG via UMAP. Representations are extracted from the downstream datasets without finetuned, which contains 312,879 unique molecules. Each point is coloured by its corresponding molecular weight(g/mol).
  • Figure 3: Visualization of the learned attention map and corresponding molecular structure (bond connectivity and 3D distance in Angstrom) for SMILES "CP(Br)C=O".