Table of Contents
Fetching ...

Y-Mol: A Multiscale Biomedical Knowledge-Guided Large Language Model for Drug Development

Tengfei Ma, Xuan Lin, Tianle Li, Chaoyi Li, Long Chen, Peng Zhou, Xibao Cai, Xinyu Yang, Daojian Zeng, Dongsheng Cao, Xiangxiang Zeng

TL;DR

Y-Mol is a multiscale biomedical knowledge-guided LLM designed to accomplish tasks across lead compound discovery, pre-clinic, and clinic prediction, and drug-related interaction prediction, and significantly outperforms general-purpose LLMs in discovering lead compounds, predicting molecular properties, and identifying drug interaction events.

Abstract

Large Language Models (LLMs) have recently demonstrated remarkable performance in general tasks across various fields. However, their effectiveness within specific domains such as drug development remains challenges. To solve these challenges, we introduce \textbf{Y-Mol}, forming a well-established LLM paradigm for the flow of drug development. Y-Mol is a multiscale biomedical knowledge-guided LLM designed to accomplish tasks across lead compound discovery, pre-clinic, and clinic prediction. By integrating millions of multiscale biomedical knowledge and using LLaMA2 as the base LLM, Y-Mol augments the reasoning capability in the biomedical domain by learning from a corpus of publications, knowledge graphs, and expert-designed synthetic data. The capability is further enriched with three types of drug-oriented instructions: description-based prompts from processed publications, semantic-based prompts for extracting associations from knowledge graphs, and template-based prompts for understanding expert knowledge from biomedical tools. Besides, Y-Mol offers a set of LLM paradigms that can autonomously execute the downstream tasks across the entire process of drug development, including virtual screening, drug design, pharmacological properties prediction, and drug-related interaction prediction. Our extensive evaluations of various biomedical sources demonstrate that Y-Mol significantly outperforms general-purpose LLMs in discovering lead compounds, predicting molecular properties, and identifying drug interaction events.

Y-Mol: A Multiscale Biomedical Knowledge-Guided Large Language Model for Drug Development

TL;DR

Y-Mol is a multiscale biomedical knowledge-guided LLM designed to accomplish tasks across lead compound discovery, pre-clinic, and clinic prediction, and drug-related interaction prediction, and significantly outperforms general-purpose LLMs in discovering lead compounds, predicting molecular properties, and identifying drug interaction events.

Abstract

Large Language Models (LLMs) have recently demonstrated remarkable performance in general tasks across various fields. However, their effectiveness within specific domains such as drug development remains challenges. To solve these challenges, we introduce \textbf{Y-Mol}, forming a well-established LLM paradigm for the flow of drug development. Y-Mol is a multiscale biomedical knowledge-guided LLM designed to accomplish tasks across lead compound discovery, pre-clinic, and clinic prediction. By integrating millions of multiscale biomedical knowledge and using LLaMA2 as the base LLM, Y-Mol augments the reasoning capability in the biomedical domain by learning from a corpus of publications, knowledge graphs, and expert-designed synthetic data. The capability is further enriched with three types of drug-oriented instructions: description-based prompts from processed publications, semantic-based prompts for extracting associations from knowledge graphs, and template-based prompts for understanding expert knowledge from biomedical tools. Besides, Y-Mol offers a set of LLM paradigms that can autonomously execute the downstream tasks across the entire process of drug development, including virtual screening, drug design, pharmacological properties prediction, and drug-related interaction prediction. Our extensive evaluations of various biomedical sources demonstrate that Y-Mol significantly outperforms general-purpose LLMs in discovering lead compounds, predicting molecular properties, and identifying drug interaction events.

Paper Structure

This paper contains 36 sections, 1 equation, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Y-Mol provides large-scale corpus and instructions for drug development across 24 tasks.
  • Figure 2: The architecture of Y-Mol. Y-Mol builds the LLM paradigm for drug development, which comprises two processes: (a) The pretrain-then-finetune framework of Y-Mol begins to self-supervised pretrain LLaMA2 based on biomedical publications, then finetune LLaMA2 using constructed instructions; (b) Y-Mol evaluates downstream tasks on the finetuned LLaMA2.
  • Figure 3: The process of biomedical corpus and instructions: (A) Collecting large-scale biomedical corpus from biomedical publications within the domain of drug discovery. (B) Constructing instructions from coherent facts for enhancing the context of drug-related interactions. (C) Building instructions from expert synthetic data from existing small models to distill knowledge spectrum of drugs into Y-Mol.
  • Figure 4: The process of supervised finetuning of Y-Mol based on designed instructions.
  • Figure 5: The data distribution of Y-Mol in pretraining and supervised fine-tuning stages across different tasks.
  • ...and 4 more figures