Table of Contents
Fetching ...

A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language

Bing Su, Dazhao Du, Zhao Yang, Yujie Zhou, Jiangmeng Li, Anyi Rao, Hao Sun, Zhiwu Lu, Ji-Rong Wen

TL;DR

A molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data via contrastive learning which enhances molecular property prediction and possesses capability to generate meaningful molecular graphs from natural language descriptions is proposed.

Abstract

Although artificial intelligence (AI) has made significant progress in understanding molecules in a wide range of fields, existing models generally acquire the single cognitive ability from the single molecular modality. Since the hierarchy of molecular knowledge is profound, even humans learn from different modalities including both intuitive diagrams and professional texts to assist their understanding. Inspired by this, we propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data (crawled from published Scientific Citation Index papers) via contrastive learning. This AI model represents a critical attempt that directly bridges molecular graphs and natural language. Importantly, through capturing the specific and complementary information of the two modalities, our proposed model can better grasp molecular expertise. Experimental results show that our model not only exhibits promising performance in cross-modal tasks such as cross-modal retrieval and molecule caption, but also enhances molecular property prediction and possesses capability to generate meaningful molecular graphs from natural language descriptions. We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine, among others.

A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language

TL;DR

A molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data via contrastive learning which enhances molecular property prediction and possesses capability to generate meaningful molecular graphs from natural language descriptions is proposed.

Abstract

Although artificial intelligence (AI) has made significant progress in understanding molecules in a wide range of fields, existing models generally acquire the single cognitive ability from the single molecular modality. Since the hierarchy of molecular knowledge is profound, even humans learn from different modalities including both intuitive diagrams and professional texts to assist their understanding. Inspired by this, we propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data (crawled from published Scientific Citation Index papers) via contrastive learning. This AI model represents a critical attempt that directly bridges molecular graphs and natural language. Importantly, through capturing the specific and complementary information of the two modalities, our proposed model can better grasp molecular expertise. Experimental results show that our model not only exhibits promising performance in cross-modal tasks such as cross-modal retrieval and molecule caption, but also enhances molecular property prediction and possesses capability to generate meaningful molecular graphs from natural language descriptions. We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine, among others.
Paper Structure (13 sections, 5 equations, 6 figures, 1 algorithm)

This paper contains 13 sections, 5 equations, 6 figures, 1 algorithm.

Figures (6)

  • Figure 1: Conceptual comparison of our MoMu model with the human brain and existing single-modal AI models. (a) The human brain can comprehensively understand molecular knowledge by learning from multiple modalities. (b) Existing AI models generally employ a single network to gain a single cognitive ability from a single modality of molecules. These models mainly fall into two categories: (top) language-based models take as input natural language texts and/or SMILES strings, which can only be applied to text-related tasks; (bottom) graph-based models take molecular graphs as input, which can only be adapted to graph-related tasks. (c) Our MoMu model learns from weakly-correlated paired text-graph data to associate the molecular graph modality with the natural language modality. It consists of two encoders to tackle the two modalities, respectively, which are jointly trained via contrastive learning. Due to the strong generalization ability of the learned representations, MoMu can be adapted to various downstream tasks such as cross-modality retrieval, molecule caption, property prediction, and text-to-graph molecule generation, and thus effectively facilitates molecular-related scientific exploration.
  • Figure 2: Graph-to-text retrieval results. (a) The performance of graph-to-text (G-T) retrieval on the PCdes dataset, where the results of the compared methods for the sentence-level retrieval are reported in zeng2022deep. (b) Retrieval results by using an example/SMILES in the test set of PCdes as the query by KV-PLM* and our MoMu-K. (c) The performance of zero-shot graph-to-text (G-T) retrieval on our collected test set.
  • Figure 3: Text-to-graph retrieval results. (a) The performance of text-to-graph (T-G) retrieval on the PCdes dataset, where the results of the compared methods for the sentence-level retrieval are reported in zeng2022deep. (b) Retrieval results by using a text paragraph in the test set of PCdes as the query by KV-PLM* and our MoMu-K. (c) A case study by using a query text to retrieve molecules that can make dyes. Four of the top-5 molecules retrieved by MoMu-K are confirmed to be effective. (d) The performance of zero-shot text-to-graph (T-G) retrieval on our collected test set.
  • Figure 4: Molecule caption results. (a) Comparison of MolT5 and our MoMu-enhanced MolT5 on the ChEBI-20 dataset. MolT5 represents the performance with only the MolT5 model, while MoMu+MolT5 represents the performance after adding GIN-extracted graph features to the input of the MolT5 encoder. (b) Example captions generated by different models.
  • Figure 5: Text-to-graph molecule generation results. (a) Molecules imagined from high-level vague descriptions. (b) Molecules imagined from functional descriptions. (c) Molecules imagined from structural descriptions.
  • ...and 1 more figures