Table of Contents
Fetching ...

DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery

Tianyu Liu, Sihan Jiang, Fan Zhang, Kunyang Sun, Teresa Head-Gordon, Hongyu Zhao

Abstract

Large language models (LLMs) are in the ascendancy for research in drug discovery, offering unprecedented opportunities to reshape drug research by accelerating hypothesis generation, optimizing candidate prioritization, and enabling more scalable and cost-effective drug discovery pipelines. However there is currently a lack of objective assessments of LLM performance to ascertain their advantages and limitations over traditional drug discovery platforms. To tackle this emergent problem, we have developed DrugPlayGround, a framework to evaluate and benchmark LLM performance for generating meaningful text-based descriptions of physiochemical drug characteristics, drug synergism, drug-protein interactions, and the physiological response to perturbations introduced by drug molecules. Moreover, DrugPlayGround is designed to work with domain experts to provide detailed explanations for justifying the predictions of LLMs, thereby testing LLMs for chemical and biological reasoning capabilities to push their greater use at the frontier of drug discovery at all of its stages.

DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery

Abstract

Large language models (LLMs) are in the ascendancy for research in drug discovery, offering unprecedented opportunities to reshape drug research by accelerating hypothesis generation, optimizing candidate prioritization, and enabling more scalable and cost-effective drug discovery pipelines. However there is currently a lack of objective assessments of LLM performance to ascertain their advantages and limitations over traditional drug discovery platforms. To tackle this emergent problem, we have developed DrugPlayGround, a framework to evaluate and benchmark LLM performance for generating meaningful text-based descriptions of physiochemical drug characteristics, drug synergism, drug-protein interactions, and the physiological response to perturbations introduced by drug molecules. Moreover, DrugPlayGround is designed to work with domain experts to provide detailed explanations for justifying the predictions of LLMs, thereby testing LLMs for chemical and biological reasoning capabilities to push their greater use at the frontier of drug discovery at all of its stages.

Paper Structure

This paper contains 11 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of DrugPlayground. We prepare datasets with molecule-text-paired information as well as other multimodal sources. Our benchmark contains two different branches, one is for evaluating LLM-generated textual content, the other is for evaluating LLM-generated embeddings. Our evaluation and quantitative analysis not only includes numerical metrics, but also feedback from chemists. We also provide recommendations to choose the most suitable setting for different tasks.
  • Figure 2: Model Performance in Terms of Text Generation (a) Five LLMs' BLEU scores under standard prompts across temperature. (b) The best performance achieved by each LLM at the optimized temperature across different prompts: standard, chain of thought (CoT), and meta-cognition (Meta) prompts. (c) The five evaluation metrics on their original scales of 0-1, with combinations ordered along the y-axis in descending order of their average normalized metric scores. The Normalized Total has a 0-5 scale. The total variance is the sum of across-drug variances, which were computed by the variance between the mean metric scores of different drugs' descriptions. We consider three different prompting approaches, including normal, domain-specific meta prompting (Meta), and Chain-of-Thought (CoT).
  • Figure 3: Model Performance in Terms of Embedding for Drug Representation (a) The workflow of evaluating the embeddings generated from the best descriptions evaluated using GPT4o with a Meta prompt at temperature 0.0. Average cosine similarity of generated embeddings from five LLMs across temperatures. Each bar shows the average cosine similarity between the embeddings from generated content and embeddings from the ground truth based on a certain model at the corresponding temperature.
  • Figure 4: Benchmarking results of drug synergy prediction. (a) AUROC and accuracy across all benchmark methods for the classification task. (a) PCC and R2 across all benchmark methods for the classification task. (c) Exploration of shared capacities across different LLM-generated embeddings from the perspective of biology and chemistry.
  • Figure 5: Benchmarking results for LLM embeddings of drug-protein interaction prediction. (a)-(c) represent accuracy based on the prediction results with different LLM-generated drug embeddings across three datasets. (b)-(d) represent the drug-protein pair with detected interaction (up) and no interaction (bottom) across three datasets. We also provide the textual description of drugs from GPT-4o.
  • ...and 1 more figures