Table of Contents
Fetching ...

Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development

Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Connor W. Coley, Cao Xiao, Jimeng Sun, Marinka Zitnik

TL;DR

Therapeutics Data Commons (TDC) presents the first unified platform that integrates AI-ready datasets, learning tasks, evaluation tools, molecule generation oracles, and leaderboards across the full therapeutics pipeline. By organizing data into a three-tier design (single-instance, multi-instance, generation) and providing 66 datasets across 22 tasks, TDC enables rigorous, cross-task benchmarking and rapid method development while addressing real-world distribution shifts and heterogeneity. The work demonstrates systematic evaluation of domain-specific methods and highlights gaps between current ML approaches and practical therapeutics challenges, underscoring the need for robust, generalizable models. As an open, open-science initiative, TDC aims to accelerate algorithmic advances and the translation of ML to biomedical discovery and clinical implementation.

Abstract

Therapeutics machine learning is an emerging field with incredible opportunities for innovatiaon and impact. However, advancement in this field requires formulation of meaningful learning tasks and careful curation of datasets. Here, we introduce Therapeutics Data Commons (TDC), the first unifying platform to systematically access and evaluate machine learning across the entire range of therapeutics. To date, TDC includes 66 AI-ready datasets spread across 22 learning tasks and spanning the discovery and development of safe and effective medicines. TDC also provides an ecosystem of tools and community resources, including 33 data functions and types of meaningful data splits, 23 strategies for systematic model evaluation, 17 molecule generation oracles, and 29 public leaderboards. All resources are integrated and accessible via an open Python library. We carry out extensive experiments on selected datasets, demonstrating that even the strongest algorithms fall short of solving key therapeutics challenges, including real dataset distributional shifts, multi-scale modeling of heterogeneous data, and robust generalization to novel data points. We envision that TDC can facilitate algorithmic and scientific advances and considerably accelerate machine-learning model development, validation and transition into biomedical and clinical implementation. TDC is an open-science initiative available at https://tdcommons.ai.

Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development

TL;DR

Therapeutics Data Commons (TDC) presents the first unified platform that integrates AI-ready datasets, learning tasks, evaluation tools, molecule generation oracles, and leaderboards across the full therapeutics pipeline. By organizing data into a three-tier design (single-instance, multi-instance, generation) and providing 66 datasets across 22 tasks, TDC enables rigorous, cross-task benchmarking and rapid method development while addressing real-world distribution shifts and heterogeneity. The work demonstrates systematic evaluation of domain-specific methods and highlights gaps between current ML approaches and practical therapeutics challenges, underscoring the need for robust, generalizable models. As an open, open-science initiative, TDC aims to accelerate algorithmic advances and the translation of ML to biomedical discovery and clinical implementation.

Abstract

Therapeutics machine learning is an emerging field with incredible opportunities for innovatiaon and impact. However, advancement in this field requires formulation of meaningful learning tasks and careful curation of datasets. Here, we introduce Therapeutics Data Commons (TDC), the first unifying platform to systematically access and evaluate machine learning across the entire range of therapeutics. To date, TDC includes 66 AI-ready datasets spread across 22 learning tasks and spanning the discovery and development of safe and effective medicines. TDC also provides an ecosystem of tools and community resources, including 33 data functions and types of meaningful data splits, 23 strategies for systematic model evaluation, 17 molecule generation oracles, and 29 public leaderboards. All resources are integrated and accessible via an open Python library. We carry out extensive experiments on selected datasets, demonstrating that even the strongest algorithms fall short of solving key therapeutics challenges, including real dataset distributional shifts, multi-scale modeling of heterogeneous data, and robust generalization to novel data points. We envision that TDC can facilitate algorithmic and scientific advances and considerably accelerate machine-learning model development, validation and transition into biomedical and clinical implementation. TDC is an open-science initiative available at https://tdcommons.ai.

Paper Structure

This paper contains 66 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of Therapeutics Data Commons (TDC). TDC is a platform with AI-ready datasets and learning tasks for therapeutics, spanning the discovery and development of safe and effective medicines. TDC provides an ecosystem of tools and data functions, including strategies for systematic model evaluation, meaningful data splits, data processors, and molecule generation oracles. All resources are integrated and accessible via a Python package. TDC also provides community resources with extensive documentation and tutorials, and leaderboards for systematic model comparison and evaluation.
  • Figure 2: Therapeutics Machine Learning. Therapeutics machine learning offers incredible opportunities for expansion, innovation, and impact. Datasets and benchmarks in TDC provide a systematic model development and evaluation framework. We envision that TDC can considerably accelerate development, validation, and transition of machine learning into production and clinical implementation.
  • Figure 3: Tiered design of Therapeutics Data Commons. We organize TDC into three distinct problems. For each problem, we give a collection of learning tasks. Finally, for each task, we provide a collection of datasets. In the first tier, we have three broad machine learning problems: (a) single-instance prediction is concerned with predicting properties of individual entities; (b) multi-instance prediction is concerned predicting properties of groups of entities; and (c) generation is concerned with the automatic generation of new entities. For each problem, we have a set of learning tasks. For example, the ADME learning task aims to predict experimental properties of individual compounds; it falls under single-instance prediction. At last, for each task, we have a collection of datasets. For example, TDC.Caco2_Wang is a dataset under the ADME learning task, which, in turn, is under the single-instance prediction problem. This unique three-tier structure is, to the best of our knowledge, the first attempt at systematically organizing therapeutics ML.
  • Figure 4: Heatmap visualization of domain generalization methods performance across each domain in the TDC DTI-DG benchmark using TDC.BindingDB. We observe a significant gap between the in-distribution and out-of-distribution performance and highlight the demand for algorithmic innovation.