Table of Contents
Fetching ...

Synthetic Data from Diffusion Models Improve Drug Discovery Prediction

Bing Hu, Ashish Saragadam, Anita Layton, Helen Chen

TL;DR

This work tackles data sparsity in drug discovery by introducing Syngand, an end-to-end diffusion GNN that jointly generates ligands and their pharmacokinetic properties. Built on an extension of the DiGress framework, Syngand adds a continuous diffusion process for target properties alongside ligand diffusion, enabling sampling of properties for existing ligands. The authors merge large ligand datasets with three target-property datasets, train on ~1.3 million ligands, and demonstrate the generation of synthetic AqSolDB, LD50, and hERG data for 60k ligands, assessing both univariate property quality and downstream regression performance. Initial results show synthetic data can augment real data to improve regression outcomes on downstream tasks, though univariate data quality remains clustered near the mean and requires further optimization for broader practical impact. Overall, Syngand offers a scalable approach to addressing cross-dataset data sparsity in drug discovery, enabling researchers to probe questions that span multiple repositories.

Abstract

Artificial intelligence (AI) is increasingly used in every stage of drug development. Continuing breakthroughs in AI-based methods for drug discovery require the creation, improvement, and refinement of drug discovery data. We posit a new data challenge that slows the advancement of drug discovery AI: datasets are often collected independently from each other, often with little overlap, creating data sparsity. Data sparsity makes data curation difficult for researchers looking to answer key research questions requiring values posed across multiple datasets. We propose a novel diffusion GNN model Syngand capable of generating ligand and pharmacokinetic data end-to-end. We show and provide a methodology for sampling pharmacokinetic data for existing ligands using our Syngand model. We show the initial promising results on the efficacy of the Syngand-generated synthetic target property data on downstream regression tasks with AqSolDB, LD50, and hERG central. Using our proposed model and methodology, researchers can easily generate synthetic ligand data to help them explore research questions that require data spanning multiple datasets.

Synthetic Data from Diffusion Models Improve Drug Discovery Prediction

TL;DR

This work tackles data sparsity in drug discovery by introducing Syngand, an end-to-end diffusion GNN that jointly generates ligands and their pharmacokinetic properties. Built on an extension of the DiGress framework, Syngand adds a continuous diffusion process for target properties alongside ligand diffusion, enabling sampling of properties for existing ligands. The authors merge large ligand datasets with three target-property datasets, train on ~1.3 million ligands, and demonstrate the generation of synthetic AqSolDB, LD50, and hERG data for 60k ligands, assessing both univariate property quality and downstream regression performance. Initial results show synthetic data can augment real data to improve regression outcomes on downstream tasks, though univariate data quality remains clustered near the mean and requires further optimization for broader practical impact. Overall, Syngand offers a scalable approach to addressing cross-dataset data sparsity in drug discovery, enabling researchers to probe questions that span multiple repositories.

Abstract

Artificial intelligence (AI) is increasingly used in every stage of drug development. Continuing breakthroughs in AI-based methods for drug discovery require the creation, improvement, and refinement of drug discovery data. We posit a new data challenge that slows the advancement of drug discovery AI: datasets are often collected independently from each other, often with little overlap, creating data sparsity. Data sparsity makes data curation difficult for researchers looking to answer key research questions requiring values posed across multiple datasets. We propose a novel diffusion GNN model Syngand capable of generating ligand and pharmacokinetic data end-to-end. We show and provide a methodology for sampling pharmacokinetic data for existing ligands using our Syngand model. We show the initial promising results on the efficacy of the Syngand-generated synthetic target property data on downstream regression tasks with AqSolDB, LD50, and hERG central. Using our proposed model and methodology, researchers can easily generate synthetic ligand data to help them explore research questions that require data spanning multiple datasets.
Paper Structure (11 sections, 5 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 11 sections, 5 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of the methodology for generating target properties for existing ligands. Existing ligards with incompleted initial target properties are repeatedly partially denoised and the generated synthetic target properties are output.
  • Figure 2: Pre-training loss for ligand graph $G = (X,E)$ and y target properties as well as the relaxed validity metric.
  • Figure 3: Distribution of generated synthetic target properties, AqSolDB (left), LD50 (middle), and hERG (right) for selected ligands. Synthetic distributions are in blue while real distributions are in yellow. Log scale is used for the y-axis.