Table of Contents
Fetching ...

NaFM: Pre-training a Foundation Model for Small-Molecule Natural Products

Yuheng Ding, Bo Qiang, Yiran Zhou, Jie Yu, Qi Li, Liangren Zhang, Yusong Wang, Zhenmin Liu

TL;DR

NaFM addresses the limited generalizability of task-specific natural product models by pre-training a foundation model tailored to natural product chemistry. It introduces scaffold-aware contrastive learning and scaffold-subgraph reconstruction to prioritize biosynthetic scaffolds while preserving side-chain information, trained on a large unlabeled natural product corpus. Across taxonomy classification, biosynthetic context, and bioactivity prediction/virtual screening, NaFM achieves state-of-the-art performance and demonstrates robust generalization, including improved activity prediction and the ability to identify novel actives. The work suggests a practical path for integrating computational prediction with biosynthesis and drug discovery workflows, supported by open data and code to foster reproducibility and further research.

Abstract

Natural products, as metabolites from microorganisms, animals, or plants, exhibit diverse biological activities, making them crucial for drug discovery. Nowadays, existing deep learning methods for natural products research primarily rely on supervised learning approaches designed for specific downstream tasks. However, such one-model-for-a-task paradigm often lacks generalizability and leaves significant room for performance improvement. Additionally, existing molecular characterization methods are not well-suited for the unique tasks associated with natural products. To address these limitations, we have pre-trained a foundation model for natural products based on their unique properties. Our approach employs a novel pretraining strategy that is especially tailored to natural products. By incorporating contrastive learning and masked graph learning objectives, we emphasize evolutional information from molecular scaffolds while capturing side-chain information. Our framework achieves state-of-the-art (SOTA) results in various downstream tasks related to natural product mining and drug discovery. We first compare taxonomy classification with synthesized molecule-focused baselines to demonstrate that current models are inadequate for understanding natural synthesis. Furthermore, by diving into a fine-grained analysis at both the gene and microbial levels, NaFM demonstrates the ability to capture evolutionary information. Eventually, our method is experimented with virtual screening, illustrating informative natural product representations that can lead to more effective identification of potential drug candidates.

NaFM: Pre-training a Foundation Model for Small-Molecule Natural Products

TL;DR

NaFM addresses the limited generalizability of task-specific natural product models by pre-training a foundation model tailored to natural product chemistry. It introduces scaffold-aware contrastive learning and scaffold-subgraph reconstruction to prioritize biosynthetic scaffolds while preserving side-chain information, trained on a large unlabeled natural product corpus. Across taxonomy classification, biosynthetic context, and bioactivity prediction/virtual screening, NaFM achieves state-of-the-art performance and demonstrates robust generalization, including improved activity prediction and the ability to identify novel actives. The work suggests a practical path for integrating computational prediction with biosynthesis and drug discovery workflows, supported by open data and code to foster reproducibility and further research.

Abstract

Natural products, as metabolites from microorganisms, animals, or plants, exhibit diverse biological activities, making them crucial for drug discovery. Nowadays, existing deep learning methods for natural products research primarily rely on supervised learning approaches designed for specific downstream tasks. However, such one-model-for-a-task paradigm often lacks generalizability and leaves significant room for performance improvement. Additionally, existing molecular characterization methods are not well-suited for the unique tasks associated with natural products. To address these limitations, we have pre-trained a foundation model for natural products based on their unique properties. Our approach employs a novel pretraining strategy that is especially tailored to natural products. By incorporating contrastive learning and masked graph learning objectives, we emphasize evolutional information from molecular scaffolds while capturing side-chain information. Our framework achieves state-of-the-art (SOTA) results in various downstream tasks related to natural product mining and drug discovery. We first compare taxonomy classification with synthesized molecule-focused baselines to demonstrate that current models are inadequate for understanding natural synthesis. Furthermore, by diving into a fine-grained analysis at both the gene and microbial levels, NaFM demonstrates the ability to capture evolutionary information. Eventually, our method is experimented with virtual screening, illustrating informative natural product representations that can lead to more effective identification of potential drug candidates.

Paper Structure

This paper contains 23 sections, 7 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: (a) Overview of NaFM pre-training. After the natural product molecule undergoes scaffold-aware masking, both the masked and unmasked information are simultaneously input into a multi-layer Graph Isomorphism Network. The masked information is then processed through a projection head to predict the masked atom attributes, bond attributes, and topological information. Meanwhile, the masked and unmasked information passes through a pooling function and is used for scaffold-aware contrastive learning. (b) Details of scaffold-subgraph reconstruction. First, a subgraph is randomly selected from the scaffold, consisting of multiple atoms and chemical bonds. In the subgraph, both node and edge attributes are masked, and all nodes within the subgraph are fully connected (i.e., expanded into a fully connected graph), thereby masking the topological information. During the reconstruction process, the model needs to predict both node and edge attributes, while also distinguishing between real edges and those artificially added virtual edges. (c) Details of scaffold-aware contrastive learning. Different colors represent varying weights defined by scaffold similarity. (d) The "central dogma" of natural products. The biological source, biosynthetic gene clusters, biosynthetic pathways, and bioactivity of natural products are interconnected through the scaffold, which acts as a bridge linking these three key aspects.
  • Figure 2: The bar plots illustrate the models' performance comparison of NaFM and NPClassifier on various superclass categories (a) and biosynthetic pathways (b) including Carbohydrates, Amino Acids and Peptides, Alkaloids, Terpenoids, Shikimates and Phenylpropanoids, Polyketides, and Fatty Acids in terms of AUPRC. The gray segments indicate the shared performance (minimum AUPRC) and the light blue/orange segments highlight the additional AURPC achieved by NaFM and NPClassifier beyond each other, respectively. Higher values indicate better model performance.
  • Figure 3: Natural Products Atlas. The obtained representations are projected into two dimensions using TMAP probst2020visualization and visualized with Faerun probst2018fun. For each biological source, a representative example is highlighted in a designated box, showcasing a characteristic scaffold that typifies compounds derived from that particular source.
  • Figure 4: The figure visualizes the AUROC gap between NaFM and various molecular representations. The results show the AUROC for the 128 most frequent protein families found in bacteria and fungi. The color in each block represents the magnitude of the AUROC gap, ranging from negative to positive, with colors transitioning from red to blue. Deeper colors indicate a greater absolute difference.
  • Figure 5: (a) Docking scores of model ranked top 0.1% (approximately 600 molecules) of acetylcholinesterase inhibitors, along with 600 randomly selected molecules. The histogram and distribution of docking scores with acetylcholinesterase are shown in the figure. Orange represents the model-selected group, while blue represents the random group. The x-axis represents the docking score, and the y-axis represents frequency. A lower docking score (more left-shifted distribution) indicates higher inhibitory activity. Docking was performed using Schrödinger Maestro, with detailed settings provided in the Supplementary section 1.3. (b) A molecule with a high model score was selected for docking visualization. This molecule forms cation-$\pi$ interactions with the residues PHE-388, TYR-337, and TRP-86 in the acetylcholinesterase binding pocket (yellow dashed lines), and a hydrogen bond interaction with the residue SER-125 (red dashed line). (c) The first and second columns display several molecules with high model scores and good docking activity, along with their most similar positive counterparts encountered during the model fine-tuning process. The third column shows the maximum common substructure between these two molecules. (d) The enrichment factor results of NaFM and Glide-based molecular docking for activity screening across seven targets. The values are obtained through five-fold cross-validation, with the mean and variance reported. A higher enrichment factor indicates a stronger screening capability of the method.