Table of Contents
Fetching ...

Tx-LLM: A Large Language Model for Therapeutics

Juan Manuel Zambrano Chaves, Eric Wang, Tao Tu, Eeshit Dhaval Vaishnav, Byron Lee, S. Sara Mahdavi, Christopher Semturs, David Fleet, Vivek Natarajan, Shekoofeh Azizi

TL;DR

Tx-LLM introduces a generalist LLM fine-tuned from PaLM-2 to encode diverse therapeutic knowledge across small molecules, proteins, nucleic acids, cells, and diseases. Trained on 709 TxT datasets spanning 66 tasks, it formats tasks as instruction-context-question-answer prompts and interleaves molecular representations with free text to support classification, regression, and generation. The model achieves near-SOTA or SOTA on 43/66 tasks, with notable gains on SMILES+Text scenarios and clear evidence of positive transfer across drug types, while ablation studies highlight the importance of model size, domain finetuning, and context. These results position Tx-LLM as a promising step toward an end-to-end therapeutic development assistant, albeit with current limitations such as lack of natural-language instruction tuning and data-contamination considerations that warrant careful validation.

Abstract

Developing therapeutics is a lengthy and expensive process that requires the satisfaction of many different criteria, and AI models capable of expediting the process would be invaluable. However, the majority of current AI approaches address only a narrowly defined set of tasks, often circumscribed within a particular domain. To bridge this gap, we introduce Tx-LLM, a generalist large language model (LLM) fine-tuned from PaLM-2 which encodes knowledge about diverse therapeutic modalities. Tx-LLM is trained using a collection of 709 datasets that target 66 tasks spanning various stages of the drug discovery pipeline. Using a single set of weights, Tx-LLM simultaneously processes a wide variety of chemical or biological entities(small molecules, proteins, nucleic acids, cell lines, diseases) interleaved with free-text, allowing it to predict a broad range of associated properties, achieving competitive with state-of-the-art (SOTA) performance on 43 out of 66 tasks and exceeding SOTA on 22. Among these, Tx-LLM is particularly powerful and exceeds best-in-class performance on average for tasks combining molecular SMILES representations with text such as cell line names or disease names, likely due to context learned during pretraining. We observe evidence of positive transfer between tasks with diverse drug types (e.g.,tasks involving small molecules and tasks involving proteins), and we study the impact of model size, domain finetuning, and prompting strategies on performance. We believe Tx-LLM represents an important step towards LLMs encoding biochemical knowledge and could have a future role as an end-to-end tool across the drug discovery development pipeline.

Tx-LLM: A Large Language Model for Therapeutics

TL;DR

Tx-LLM introduces a generalist LLM fine-tuned from PaLM-2 to encode diverse therapeutic knowledge across small molecules, proteins, nucleic acids, cells, and diseases. Trained on 709 TxT datasets spanning 66 tasks, it formats tasks as instruction-context-question-answer prompts and interleaves molecular representations with free text to support classification, regression, and generation. The model achieves near-SOTA or SOTA on 43/66 tasks, with notable gains on SMILES+Text scenarios and clear evidence of positive transfer across drug types, while ablation studies highlight the importance of model size, domain finetuning, and context. These results position Tx-LLM as a promising step toward an end-to-end therapeutic development assistant, albeit with current limitations such as lack of natural-language instruction tuning and data-contamination considerations that warrant careful validation.

Abstract

Developing therapeutics is a lengthy and expensive process that requires the satisfaction of many different criteria, and AI models capable of expediting the process would be invaluable. However, the majority of current AI approaches address only a narrowly defined set of tasks, often circumscribed within a particular domain. To bridge this gap, we introduce Tx-LLM, a generalist large language model (LLM) fine-tuned from PaLM-2 which encodes knowledge about diverse therapeutic modalities. Tx-LLM is trained using a collection of 709 datasets that target 66 tasks spanning various stages of the drug discovery pipeline. Using a single set of weights, Tx-LLM simultaneously processes a wide variety of chemical or biological entities(small molecules, proteins, nucleic acids, cell lines, diseases) interleaved with free-text, allowing it to predict a broad range of associated properties, achieving competitive with state-of-the-art (SOTA) performance on 43 out of 66 tasks and exceeding SOTA on 22. Among these, Tx-LLM is particularly powerful and exceeds best-in-class performance on average for tasks combining molecular SMILES representations with text such as cell line names or disease names, likely due to context learned during pretraining. We observe evidence of positive transfer between tasks with diverse drug types (e.g.,tasks involving small molecules and tasks involving proteins), and we study the impact of model size, domain finetuning, and prompting strategies on performance. We believe Tx-LLM represents an important step towards LLMs encoding biochemical knowledge and could have a future role as an end-to-end tool across the drug discovery development pipeline.
Paper Structure (14 sections, 8 figures, 17 tables)

This paper contains 14 sections, 8 figures, 17 tables.

Figures (8)

  • Figure 1: Overview of the Tx-LLM.(top) Datasets from the Therapeutic Data Commons are used to construct the Therapeutics instruction Tuning (TxT) collection. The original tabular datasets contain a variety of drug types including small molecules, macro-molecules such as proteins and nucleic acids, cells, and genes. The tasks encompass a broad range of areas relevant to drug discovery and development such as predicting targets, evaluating efficacy and safety, and predicting ease of manufacturing. TxT interleaves free-text instructions with string representations of molecules, such as SMILES strings for small molecules or amino acid sequences for proteins. TxT is used to prompt and finetune Tx-LLM to solve classification, regression, or generation tasks. (bottom) Example of a TxT prompt for predicting drug synergy. The prompt is composed of Instructions, Context, and a Question using information from the corresponding TDC dataset and/or literature search and may also contain exemplars to aid in-context learning.
  • Figure 2: Tx-LLM may be effective for end-to-end therapeutic development. Tx-LLM is a single model that can be queried for multiple steps of the therapeutic development process, covering tasks from early-stage target discovery to late-stage clinical trial approval. We list example tasks associated with each stage of the therapeutic development pipeline, example datasets in TDC that correspond to these tasks, and example prompts that can be used to query Tx-LLM. For illustration, the example prompts are geared towards discovering new small molecules against targets associated with type 2 diabetes, and the datasets associated with the example prompts are shown in bold.
  • Figure 3: Comparison of Tx-LLM's performance with SOTA. Tx-LLM is evaluated on each dataset in TDC, and comparison with SOTA for different metrics is illustrated in panels. Datasets are colored by their feature types indicated in the legend, and marker sizes illustrate the number of data points in the task on a log scale. The larger shaded area in green indicates where Tx-LLM outperforms SOTA, while the narrower orange shaded area indicates where Tx-LLM is near SOTA (defined as within 10%). MAE and MSE values are log-transformed because the magnitudes of these values depend on the units of the outputs. Generation accuracy is the fraction of correct SMILES strings in the USPTO generation task.
  • Figure 4: Tx-LLM shows evidence of positive transfer across datasets with diverse drug types. Performance of Tx-LLM (S) finetuned and evaluated on small molecule datasets. "All datasets" indicates a Tx-LLM (S) model finetuned on all TDC datasets, and "Molecule datasets" indicates a Tx-LLM (S) model finetuned on datasets containing molecules (datasets involving other drug types such as proteins or nucleic acids are not included in training). Datasets are colored by their feature types indicated in the legend, and marker sizes illustrate the number of data points in the task on a log scale. The larger shaded area in green indicates where "All datasets" is better than "Molecule dataset" (showing evidence of positive transfer), while the narrower orange shaded area indicates where the performance of "Molecule datasets" is near the performance of "All dataset" (defined as within 10%). MAE and MSE values are log-transformed because the magnitudes of these values depend on the units of the outputs. Generation accuracy is the fraction of correct SMILES strings in the USPTO generation dataset.
  • Figure A.1: Distribution of TDC dataset sizes, aggregated over train, validation, and test sets. For datasets containing multiple subtasks, such as ToxCast which contains data for more than 600 different assays, the dataset size is calculated by summing over the sizes for each subtask.
  • ...and 3 more figures