Table of Contents
Fetching ...

MolMetaLM: a Physicochemical Knowledge-Guided Molecular Meta Language Model

Yifan Wu, Min Zeng, Yang Li, Yang Zhang, Min Li

TL;DR

This study proposes a novel physicochemical knowledge-guided molecular meta language framework MolMetaLM, designed as multiple(subject, predicate, object) knowledge triples sharing the same S to enhance learning the semantic relationships between physicochemical knowledge and molecules.

Abstract

Most current molecular language models transfer the masked language model or image-text generation model from natural language processing to molecular field. However, molecules are not solely characterized by atom/bond symbols; they encapsulate important physical/chemical properties. Moreover, normal language models bring grammar rules that are irrelevant for understanding molecules. In this study, we propose a novel physicochemical knowledge-guided molecular meta language framework MolMetaLM. We design a molecule-specialized meta language paradigm, formatted as multiple <S,P,O> (subject, predicate, object) knowledge triples sharing the same S (i.e., molecule) to enhance learning the semantic relationships between physicochemical knowledge and molecules. By introducing different molecular knowledge and noises, the meta language paradigm generates tens of thousands of pretraining tasks. By recovering the token/sequence/order-level noises, MolMetaLM exhibits proficiency in large-scale benchmark evaluations involving property prediction, molecule generation, conformation inference, and molecular optimization. Through MolMetaLM, we offer a new insight for designing language models.

MolMetaLM: a Physicochemical Knowledge-Guided Molecular Meta Language Model

TL;DR

This study proposes a novel physicochemical knowledge-guided molecular meta language framework MolMetaLM, designed as multiple(subject, predicate, object) knowledge triples sharing the same S to enhance learning the semantic relationships between physicochemical knowledge and molecules.

Abstract

Most current molecular language models transfer the masked language model or image-text generation model from natural language processing to molecular field. However, molecules are not solely characterized by atom/bond symbols; they encapsulate important physical/chemical properties. Moreover, normal language models bring grammar rules that are irrelevant for understanding molecules. In this study, we propose a novel physicochemical knowledge-guided molecular meta language framework MolMetaLM. We design a molecule-specialized meta language paradigm, formatted as multiple <S,P,O> (subject, predicate, object) knowledge triples sharing the same S (i.e., molecule) to enhance learning the semantic relationships between physicochemical knowledge and molecules. By introducing different molecular knowledge and noises, the meta language paradigm generates tens of thousands of pretraining tasks. By recovering the token/sequence/order-level noises, MolMetaLM exhibits proficiency in large-scale benchmark evaluations involving property prediction, molecule generation, conformation inference, and molecular optimization. Through MolMetaLM, we offer a new insight for designing language models.

Paper Structure

This paper contains 13 sections, 4 equations, 5 figures.

Figures (5)

  • Figure 1: General framework of MolMetaLM. (a) Construction of the input meta language. The input to MolMetaLM is a mixture of k $<\mathrm{S},\mathrm{P},\mathrm{O}>$ triples that share the same $\mathrm{S}$. Then the token-level, sequence-level, and order-level noises are added to the input to construct the source and target sequences. The noise design at different level drives the model to learn to handle different denoising scenarios, so as to achieve the generation goal in different tasks. (b) The backbone of MolMetaLM. It is a stacked transformer decoder variant that applies the RMSNorm, SwiGLU, and rotary position embedding. (c) The application of MolMetaLM. With different design of the input meta language, MolMetaLM demonstrates proficiency in molecular generation, molecular optimization, property prediction, structure prediction, and other tasks.
  • Figure 2: (a) The relation between the conditional property values and the values of the generated molecules from different methods. For each sub-figure of the upper part, the x-axis indicates the given property value constraints, y-axis indicates the property values of the generated molecules. For each method, if the generated molecules are valid, the corresponding points will be marked, and all marked points are fitted as a curve of degree one. The slopes of the curves are included in parentheses in the legends. It is worth noting that a slope closer to 1.0 does not always indicate better performance. For instance, in the generation conditioned by TPSA (bottom part), despite MolT5 exhibiting a slope of 0.982, it struggles to generate molecules with the given TPSA value and instead generates molecules with a value of 1003 in the generation task of TPSA 500, resulting in a slope close to 1.0. (b) Results of multiple-condition molecule generation. (c) Results of unconditional molecule generation. (d) Curves of similarity and pLogP improvement on Jin's test set. (e) Fingerprint/structure-based molecule generation cases using MolMetaLM. For fingerprint-based molecule generation, it generates molecules with high Tanimoto similarity based on MACCS; for structure-based molecule generation, it not only generates similar molecules but also generates ones with similar high docking affinities. The docking affinities are obtained from AutoDock Vina.
  • Figure 3: (a) The fine-tuned framework of MolMetaLM for molecular property prediction. First, SMILES are input into the pre-trained MolMetaLM to obtain the sequential molecular representation. Then a max-pooling and an avg-pooling are applied to extract the molecule-level features. Finally, the molecule-level features are fed into a FFN to generate the final predicted property value. (b) Blind docking results on the CASF-2016 test set. x-axis is the RMSD cutoff, y-axis is the percentage of ligand RMSD below the cutoff. These curves indicate how accurate these methods in docking ligands into the correct binding pocket and predicting the correct binding poses. (c) Classification results of MoleculeNet benchmark datasets. x-axis denotes different methods, y-axis is the macro AUROC results. (d) RMSEs of molecular activity prediction for the ten GPCR targets. (e) Regression results of comparing to AGBT's benchmark datasets. x-axis indicates 7 molecular property regression datasets, y-axis is R2 metrics defined as the squared Pearson correlations.
  • Figure 4: (a) Scatter plot of the numerical differences in the condition sequences versus the cosine distances of the constraint embedded vectors. x-axis denotes the absolute values of the numerical differences in the condition sequences, y-axis is the cosine distance between their embedded vectors. (b) Pearson correlation coefficient of similarities obtained by different molecular fingerprints or representations. The Pearson correlation coefficients (pearsonr) between them are shown in the title, and the coefficients of determination ($R^2$) of the fitted line are recorded in the bottom right corner of each figure. (c) Linear separable boundaries for molecular representations of Uni-Mol and MolMetaLM on four binary classification datasets.
  • Figure 5: (a) Correlation between the performance of SMILES-based property prediction and property-based molecule generation. (b) Molecule embedding space and the variant of the space with the introduction of properties. (c) The analogical reasoning process of MolMetaLM. Generally speaking, during the pretraining process, MolMetaLM learns the embedding of all training samples and constructs the SMILES-PKC embedding space. During the inference process, MolMetaLM acquires the embedding $e_q$ of the query sequence and retrieves the relevant samples $(e_h,e_t)$ from the memorized training samples. Finally, the result is generated by integrating the retrieved samples and performing the analogical reasoning in the SMILES-PKC embedding space.