DrugAssist: A Large Language Model for Molecule Optimization

Geyan Ye; Xibao Cai; Houtim Lai; Xing Wang; Junhong Huang; Longyue Wang; Wei Liu; Xiangxiang Zeng

DrugAssist: A Large Language Model for Molecule Optimization

Geyan Ye, Xibao Cai, Houtim Lai, Xing Wang, Junhong Huang, Longyue Wang, Wei Liu, Xiangxiang Zeng

TL;DR

The paper tackles molecule optimization in drug discovery by introducing DrugAssist, an interactive, human-in-the-loop LLM framework fine-tuned for multi-turn dialogue with domain experts. It pairs this with MolOpt-Instructions, a large instruction-based dataset built from matched molecular pairs to enable realistic, range-aware optimization across multiple properties. The approach demonstrates state-of-the-art performance on single- and multi-property tasks, plus notable transferability and iterative refinement via user feedback. By releasing both the dataset and code, the authors aim to catalyze broader adoption of interactive LLMs in drug discovery and real-world optimization scenarios.

Abstract

Recently, the impressive performance of large language models (LLMs) on a wide range of tasks has attracted an increasing number of attempts to apply LLMs in drug discovery. However, molecule optimization, a critical task in the drug discovery pipeline, is currently an area that has seen little involvement from LLMs. Most of existing approaches focus solely on capturing the underlying patterns in chemical structures provided by the data, without taking advantage of expert feedback. These non-interactive approaches overlook the fact that the drug discovery process is actually one that requires the integration of expert experience and iterative refinement. To address this gap, we propose DrugAssist, an interactive molecule optimization model which performs optimization through human-machine dialogue by leveraging LLM's strong interactivity and generalizability. DrugAssist has achieved leading results in both single and multiple property optimization, simultaneously showcasing immense potential in transferability and iterative optimization. In addition, we publicly release a large instruction-based dataset called MolOpt-Instructions for fine-tuning language models on molecule optimization tasks. We have made our code and data publicly available at https://github.com/blazerye/DrugAssist, which we hope to pave the way for future research in LLMs' application for drug discovery.

DrugAssist: A Large Language Model for Molecule Optimization

TL;DR

Abstract

Paper Structure (24 sections, 9 figures, 6 tables)

This paper contains 24 sections, 9 figures, 6 tables.

Introduction
Related Work
Traditional approaches in molecule optimization
Sequence-based
Graph-based
LLMs in biomedical domain
Methods
Construction of MolOpt-Instructions Dataset
Overview and Statistics
Data Construction
Analysis and Discussion
Instruction Tuning
Experiments
Experimental Setup
Models
...and 9 more sections

Figures (9)

Figure 1: The illustration of our proposed DrugAssist model framework, which focus on optimizing molecules through human-machine dialogue.
Figure 2: The workflow of data construction of MolOpt-Instructions. First, we randomly picked one million molecules from the ZINC dataset. Then, we used mmpdb dalke2018mmpdb to generate similar pairs based on these molecules and selected the molecular pairs that met our requirements from these candidates. Once we identified the suitable molecular pairs, we proceeded to calculate their property values using iDrug idrug. After obtaining these pairs and their corresponding property values, we asked ChatGPT to suggest a variety of instructions and manually refine them for the molecule optimization tasks.
Figure 3: Distribution of structural properties of molecules within MolOpt-Instructions, illustrating the structural diversity of the molecules.
Figure 4: Distribution of ADMET-related properties of molecules within MolOpt-Instructions. Currently, MolOpt-Instructions covers six properties, namely Solubility, BBBP (Blood-Brain Barrier Penetration), hERG (Human Ether-a-go-go-Related Gene) inhibition, QED (Quantitative Estimate of Drug-likeness) and the number of hydrogen bond donor and acceptor. The distribution graph demonstrates the diversity of biochemical properties of molecules in our dataset.
Figure 5: The illustration of multi-task learning strategy. We apply instruction tuning by directly combining different data sources (general knowledge and molecule optimization), effectively mitigating catastrophic forgetting during the fine-tuning stage.
...and 4 more figures

DrugAssist: A Large Language Model for Molecule Optimization

TL;DR

Abstract

DrugAssist: A Large Language Model for Molecule Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (9)