Table of Contents
Fetching ...

TxGemma: Efficient and Agentic LLMs for Therapeutics

Eric Wang, Samuel Schmidgall, Paul F. Jaeger, Fan Zhang, Rory Pilgrim, Yossi Matias, Joelle Barral, David Fleet, Shekoofeh Azizi

TL;DR

TxGemma presents a suite of efficient generalist LLMs for therapeutics that jointly predict, explain, and reason over diverse therapeutic data. By fine-tuning Gemma-2 on Therapeutics Data Commons with therapeutic instruction-tuning, TxGemma achieves competitive or superior performance across 66 therapeutic tasks, outperforming state-of-the-art generalist and specialist models in many cases. The introduction of TxGemma-Chat enables mechanistic explanations of predictions, while Agentic-Tx demonstrates multi-step, tool-assisted workflows that attain state-of-the-art results on challenging chemistry and biology benchmarks. The work also emphasizes data efficiency, open-model release, and the integration of external knowledge through a modular agentic system, offering a practical path toward faster, more transparent therapeutic discovery. Overall, TxGemma and Agentic-Tx illustrate a meaningful shift toward open, interactive, and scalable AI-assisted therapeutic development, with potential impact from early hypothesis generation to prospective trial planning.

Abstract

Therapeutic development is a costly and high-risk endeavor that is often plagued by high failure rates. To address this, we introduce TxGemma, a suite of efficient, generalist large language models (LLMs) capable of therapeutic property prediction as well as interactive reasoning and explainability. Unlike task-specific models, TxGemma synthesizes information from diverse sources, enabling broad application across the therapeutic development pipeline. The suite includes 2B, 9B, and 27B parameter models, fine-tuned from Gemma-2 on a comprehensive dataset of small molecules, proteins, nucleic acids, diseases, and cell lines. Across 66 therapeutic development tasks, TxGemma achieved superior or comparable performance to the state-of-the-art generalist model on 64 (superior on 45), and against state-of-the-art specialist models on 50 (superior on 26). Fine-tuning TxGemma models on therapeutic downstream tasks, such as clinical trial adverse event prediction, requires less training data than fine-tuning base LLMs, making TxGemma suitable for data-limited applications. Beyond these predictive capabilities, TxGemma features conversational models that bridge the gap between general LLMs and specialized property predictors. These allow scientists to interact in natural language, provide mechanistic reasoning for predictions based on molecular structure, and engage in scientific discussions. Building on this, we further introduce Agentic-Tx, a generalist therapeutic agentic system powered by Gemini 2.5 that reasons, acts, manages diverse workflows, and acquires external domain knowledge. Agentic-Tx surpasses prior leading models on the Humanity's Last Exam benchmark (Chemistry & Biology) with 52.3% relative improvement over o3-mini (high) and 26.7% over o3-mini (high) on GPQA (Chemistry) and excels with improvements of 6.3% (ChemBench-Preference) and 2.4% (ChemBench-Mini) over o3-mini (high).

TxGemma: Efficient and Agentic LLMs for Therapeutics

TL;DR

TxGemma presents a suite of efficient generalist LLMs for therapeutics that jointly predict, explain, and reason over diverse therapeutic data. By fine-tuning Gemma-2 on Therapeutics Data Commons with therapeutic instruction-tuning, TxGemma achieves competitive or superior performance across 66 therapeutic tasks, outperforming state-of-the-art generalist and specialist models in many cases. The introduction of TxGemma-Chat enables mechanistic explanations of predictions, while Agentic-Tx demonstrates multi-step, tool-assisted workflows that attain state-of-the-art results on challenging chemistry and biology benchmarks. The work also emphasizes data efficiency, open-model release, and the integration of external knowledge through a modular agentic system, offering a practical path toward faster, more transparent therapeutic discovery. Overall, TxGemma and Agentic-Tx illustrate a meaningful shift toward open, interactive, and scalable AI-assisted therapeutic development, with potential impact from early hypothesis generation to prospective trial planning.

Abstract

Therapeutic development is a costly and high-risk endeavor that is often plagued by high failure rates. To address this, we introduce TxGemma, a suite of efficient, generalist large language models (LLMs) capable of therapeutic property prediction as well as interactive reasoning and explainability. Unlike task-specific models, TxGemma synthesizes information from diverse sources, enabling broad application across the therapeutic development pipeline. The suite includes 2B, 9B, and 27B parameter models, fine-tuned from Gemma-2 on a comprehensive dataset of small molecules, proteins, nucleic acids, diseases, and cell lines. Across 66 therapeutic development tasks, TxGemma achieved superior or comparable performance to the state-of-the-art generalist model on 64 (superior on 45), and against state-of-the-art specialist models on 50 (superior on 26). Fine-tuning TxGemma models on therapeutic downstream tasks, such as clinical trial adverse event prediction, requires less training data than fine-tuning base LLMs, making TxGemma suitable for data-limited applications. Beyond these predictive capabilities, TxGemma features conversational models that bridge the gap between general LLMs and specialized property predictors. These allow scientists to interact in natural language, provide mechanistic reasoning for predictions based on molecular structure, and engage in scientific discussions. Building on this, we further introduce Agentic-Tx, a generalist therapeutic agentic system powered by Gemini 2.5 that reasons, acts, manages diverse workflows, and acquires external domain knowledge. Agentic-Tx surpasses prior leading models on the Humanity's Last Exam benchmark (Chemistry & Biology) with 52.3% relative improvement over o3-mini (high) and 26.7% over o3-mini (high) on GPQA (Chemistry) and excels with improvements of 6.3% (ChemBench-Preference) and 2.4% (ChemBench-Mini) over o3-mini (high).

Paper Structure

This paper contains 25 sections, 2 equations, 25 figures, 21 tables.

Figures (25)

  • Figure 1: Overview of TxGemma. (top) All TxGemma variants are trained on diverse data sources of the Therapeutic Data Commons (TDC). TxGemma-Predict comes in three size variants (2B, 9B, and 27B) and is trained for high-performance predictions on a broad set of therapeutic development tasks. TxGemma-Chat features two variants (9B and 27B) and is trained on a combination of TDC data with general Gemma-2 instruction tuning data to retain conversational and reasoning capabilities. Agentic-Tx, a therapeutics-focused agentic system powered by Gemini 2.5, has access to 18 tools including TxGemma-Predict and TxGemma-Chat to collect external knowledge and manages complex tasks in either autonomous or interactive settings. (bottom-right) Absolute performance of Agentic-Tx compared to best-in-class models on three complex therapeutic-related reasoning benchmarks. The state-of-the-art (SOTA) values are obtained from mirza2024largeOpenAIReasoningLLMs and details are listed in \ref{['tab:tx-agent-perf']}. Dashed lines: L=lowest, M=mean, H=highest human scores. (bottom-left) Relative performance changes of TxGemma-Predict compared to the SOTA generalist model for each task type. The assignment of the 66 evaluated TDC tasks to task types is shown in Tables \ref{['tab-sup:binary_dataset_sizes']} and \ref{['tab-sup:regression_generation_dataset_sizes']}. The bottom bar chart shows a summary of results where TxGemma-Predict outperforms or nearly matches SOTA (light blue), and outperforms SOTA (darker blue).
  • Figure 2: Example workflow of agentic planning and execution with Agentic-Tx. Agentic-Tx uses the ReAct framework yao2022react to interleave thought with tool-usage. When a user poses a query, Agentic-Tx checks whether the query structure matches any defined tool trigger. If so, the query is routed to the corresponding tool, which (i) parses the request, (ii) invokes specialized logic, and (iii) returns a structured answer to the agent. The agent then composes a user-facing response. This adaptive tool-use mechanism is especially helpful for tasks that require external references, chemical data transformations, or precise chemical information, areas where self-contained LLMs often hallucinate. In the displayed example, Agentic-Tx uses two tools to solve a complex therapeutic task: TxGemma-Chat and the clinical toxicity prediction tool based on TxGemma-Predict.
  • Figure 3: Comparison of TxGemma-Predict's performance with therapeutic generalist models.(top) relative performance improvement of TxGemma-27B-Predict in comparison to Tx-LLM S. TxGemma-27B-Predict outperforms Tx-LLM S on 62 and underperforms on only 4. (bottom) relative performance improvement of TxGemma-27B-Predict in comparison to Tx-LLM M. TxGemma-27B-Predict outperforms Tx-LLM M on 45 out of 66 tasks, while underperforming on 21. When aggregating performance over task, we observe a net improvement of TxGemma-27B-Predict over Tx-LLM models, with a statistically significant difference (p=0.003, Wilcoxon signed-rank test). These results establish TxGemma-27B-Predict as a competitive and functionally enhanced alternative at practical model sizes. Values for each task can be found in \ref{['tab:binary_results_chat_and_txllm', 'tab:regression_generation_chat_and_txllm']}.
  • Figure 4: Comparison of TxGemma's performance with best-in-class specialist models. TxGemma-27B-Predict is evaluated on each task in TDC and compared to the corresponding best-in-class competitor. The panels depict different metrics used to evaluate the tasks. Tasks are colored by their feature types including one or a combination of SMILE, Amino acid, Nucleotide and text as indicated in the legend. Marker sizes illustrate the number of data points in the task on a log scale. The larger shaded area in blue indicates where TxGemma outperforms best-in-class models, while the narrower light blue shaded area indicates where TxGemma is performing near best-in-class model (defined as within 10%). MAE and MSE values are log-transformed since the magnitudes of these values depend on the units of outputs. Generation accuracy is the fraction of correct SMILES strings in the USPTO generation task. Values for each task can also be found in \ref{['tab:binary_results', 'tab:regression_generation_results']}.
  • Figure 5: TxGemma-Chat bridges the gap between property predictors and general LLMs. Each point represents a therapeutic task in the TDC. The figure depicts relative predictive performance changes of TxGemma-Chat in comparison to TxGemma-Predict (top) and Gemma-2 (bottom) for 9B variants left and 27B variants in right. As expected, TxGemma-27B-Predict outperforms TxGemma-27B-Chat on therapeutic tasks, with TxGemma-27B-Chat showing a 10.69% median relative performance reduction. However, TxGemma-27B-Chat exceeds the Gemma-2-27B baseline by 29.67% on TDC therapeutic tasks. Similarly, TxGemma-9B-Chat's performance is 10.32% lower than TxGemma-9B-Predict's. Values for each task can be found in \ref{['tab:binary_results_chat_and_txllm', 'tab:regression_generation_chat_and_txllm']}.
  • ...and 20 more figures