Table of Contents
Fetching ...

Emerging Opportunities of Using Large Language Models for Translation Between Drug Molecules and Indications

David Oniani, Jordan Hilsman, Chengxi Zang, Junmei Wang, Lianjin Cai, Jan Zawala, Yanshan Wang

TL;DR

This paper proposes a new task, the translation between drug molecules and corresponding indications, and test existing LLMs on this new task, and considers nine variations of the T5 LLM and evaluates them on two public datasets obtained from ChEMBL and DrugBank.

Abstract

A drug molecule is a substance that changes the organism's mental or physical state. Every approved drug has an indication, which refers to the therapeutic use of that drug for treating a particular medical condition. While the Large Language Model (LLM), a generative Artificial Intelligence (AI) technique, has recently demonstrated effectiveness in translating between molecules and their textual descriptions, there remains a gap in research regarding their application in facilitating the translation between drug molecules and indications, or vice versa, which could greatly benefit the drug discovery process. The capability of generating a drug from a given indication would allow for the discovery of drugs targeting specific diseases or targets and ultimately provide patients with better treatments. In this paper, we first propose a new task, which is the translation between drug molecules and corresponding indications, and then test existing LLMs on this new task. Specifically, we consider nine variations of the T5 LLM and evaluate them on two public datasets obtained from ChEMBL and DrugBank. Our experiments show the early results of using LLMs for this task and provide a perspective on the state-of-the-art. We also emphasize the current limitations and discuss future work that has the potential to improve the performance on this task. The creation of molecules from indications, or vice versa, will allow for more efficient targeting of diseases and significantly reduce the cost of drug discovery, with the potential to revolutionize the field of drug discovery in the era of generative AI.

Emerging Opportunities of Using Large Language Models for Translation Between Drug Molecules and Indications

TL;DR

This paper proposes a new task, the translation between drug molecules and corresponding indications, and test existing LLMs on this new task, and considers nine variations of the T5 LLM and evaluates them on two public datasets obtained from ChEMBL and DrugBank.

Abstract

A drug molecule is a substance that changes the organism's mental or physical state. Every approved drug has an indication, which refers to the therapeutic use of that drug for treating a particular medical condition. While the Large Language Model (LLM), a generative Artificial Intelligence (AI) technique, has recently demonstrated effectiveness in translating between molecules and their textual descriptions, there remains a gap in research regarding their application in facilitating the translation between drug molecules and indications, or vice versa, which could greatly benefit the drug discovery process. The capability of generating a drug from a given indication would allow for the discovery of drugs targeting specific diseases or targets and ultimately provide patients with better treatments. In this paper, we first propose a new task, which is the translation between drug molecules and corresponding indications, and then test existing LLMs on this new task. Specifically, we consider nine variations of the T5 LLM and evaluate them on two public datasets obtained from ChEMBL and DrugBank. Our experiments show the early results of using LLMs for this task and provide a perspective on the state-of-the-art. We also emphasize the current limitations and discuss future work that has the potential to improve the performance on this task. The creation of molecules from indications, or vice versa, will allow for more efficient targeting of diseases and significantly reduce the cost of drug discovery, with the potential to revolutionize the field of drug discovery in the era of generative AI.
Paper Structure (12 sections, 2 figures, 7 tables)

This paper contains 12 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Overview of the methodology of the experiments: drug data is compiled from ChEMBL and DrugBank and utilized as input for MolT5. Our experiments involved two tasks: drug-to-indication and indication-to-drug. For drug-to-indication, SMILES strings of existing drugs were used as input, producing drug indications as output. Conversely, for drug-to-indication, drug indications of the same set of drugs were the input, resulting in SMILES strings as output. Additionally, we augmented MolT5 with a custom tokenizer in pretraining and evaluated the resulting model on the same tasks.
  • Figure 2: MolT5 and custom tokenizers: MolT5 tokenizer uses the default English language tokenization and splits the input text into subwords. The intuition is that SMILES strings are composed of characters typically found in English text, and pretraining on large-scale English corpora may be helpful. On the other hand, the custom tokenizer method utilizes the grammar of SMILES and decomposes the input into grammatically valid components.