DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs
Youwei Liang, Ruiyi Zhang, Li Zhang, Pengtao Xie
TL;DR
DrugChat tackles enabling ChatGPT-like interactivity on drug molecule graphs by integrating a graph neural network with a large language model through an adaptor that translates graph representations into LLM-friendly prompts. The system is trained on instruction-tuning data from ChEMBL and PubChem (10,834 compounds and 143,517 QA pairs), with GNN and LLM weights kept fixed while the adaptor is trained to minimize negative log-likelihood of ground-truth answers. This approach aims to accelerate drug discovery tasks by enabling multi-turn questions about molecular graphs, SAR insights, and lead optimization suggestions. Limitations include potential language hallucinations, which may be mitigated by higher-quality data, filtering, and possibly reinforcement learning with human feedback.
Abstract
A ChatGPT-like system for drug compounds could be a game-changer in pharmaceutical research, accelerating drug discovery, enhancing our understanding of structure-activity relationships, guiding lead optimization, aiding drug repurposing, reducing the failure rate, and streamlining clinical trials. In this work, we make an initial attempt towards enabling ChatGPT-like capabilities on drug molecule graphs, by developing a prototype system DrugChat. DrugChat works in a similar way as ChatGPT. Users upload a compound molecule graph and ask various questions about this compound. DrugChat will answer these questions in a multi-turn, interactive manner. The DrugChat system consists of a graph neural network (GNN), a large language model (LLM), and an adaptor. The GNN takes a compound molecule graph as input and learns a representation for this graph. The adaptor transforms the graph representation produced by the GNN into another representation that is acceptable to the LLM. The LLM takes the compound representation transformed by the adaptor and users' questions about this compound as inputs and generates answers. All these components are trained end-to-end. To train DrugChat, we collected instruction tuning datasets which contain 10,834 drug compounds and 143,517 question-answer pairs. The code and data is available at \url{https://github.com/UCSD-AI4H/drugchat}
