Table of Contents
Fetching ...

NPGPT: Natural Product-Like Compound Generation with GPT-based Chemical Language Models

Koh Sakano, Kairi Furui, Masahito Ohue

TL;DR

This study trained chemical language models on a natural product dataset and generated natural product-like compounds and verified the performance of the generated compounds as a drug candidate library and the effectiveness of the generated compounds as drug candidates.

Abstract

Natural products are substances produced by organisms in nature and often possess biological activity and structural diversity. Drug development based on natural products has been common for many years. However, the intricate structures of these compounds present challenges in terms of structure determination and synthesis, particularly compared to the efficiency of high-throughput screening of synthetic compounds. In recent years, deep learning-based methods have been applied to the generation of molecules. In this study, we trained chemical language models on a natural product dataset and generated natural product-like compounds. The results showed that the distribution of the compounds generated was similar to that of natural products. We also evaluated the effectiveness of the generated compounds as drug candidates. Our method can be used to explore the vast chemical space and reduce the time and cost of drug discovery of natural products.

NPGPT: Natural Product-Like Compound Generation with GPT-based Chemical Language Models

TL;DR

This study trained chemical language models on a natural product dataset and generated natural product-like compounds and verified the performance of the generated compounds as a drug candidate library and the effectiveness of the generated compounds as drug candidates.

Abstract

Natural products are substances produced by organisms in nature and often possess biological activity and structural diversity. Drug development based on natural products has been common for many years. However, the intricate structures of these compounds present challenges in terms of structure determination and synthesis, particularly compared to the efficiency of high-throughput screening of synthetic compounds. In recent years, deep learning-based methods have been applied to the generation of molecules. In this study, we trained chemical language models on a natural product dataset and generated natural product-like compounds. The results showed that the distribution of the compounds generated was similar to that of natural products. We also evaluated the effectiveness of the generated compounds as drug candidates. Our method can be used to explore the vast chemical space and reduce the time and cost of drug discovery of natural products.

Paper Structure

This paper contains 12 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Examples of SMILES and SELFIES encoding (cochliodinol, a natural product compound)
  • Figure 2: t-SNE visualization of 2,000 molecules generated by the original and fine-tuned models of smiles-gpt, along with molecules from COCONUT.
  • Figure 3: t-SNE visualization of 2,000 molecules generated by the original and fine-tuned models of ChemGPT, along with molecules from COCONUT.
  • Figure 4: Kernel density estimation of NP Scores for molecules generated by the original and fine-tuned models of smiles-gpt and ChemGPT, compared with molecules generated in the previous research (Tay et al.Tay2023-nr) and natural products from COCONUT.
  • Figure 5: Kernel density estimation of SA Scores for molecules generated by the original and fine-tuned models of smiles-gpt and ChemGPT, compared with molecules generated in the previous research (Tay et al.Tay2023-nr) and natural products from COCONUT.
  • ...and 4 more figures