Synthetic Data from Diffusion Models Improve Drug Discovery Prediction
Bing Hu, Ashish Saragadam, Anita Layton, Helen Chen
TL;DR
This work tackles data sparsity in drug discovery by introducing Syngand, an end-to-end diffusion GNN that jointly generates ligands and their pharmacokinetic properties. Built on an extension of the DiGress framework, Syngand adds a continuous diffusion process for target properties alongside ligand diffusion, enabling sampling of properties for existing ligands. The authors merge large ligand datasets with three target-property datasets, train on ~1.3 million ligands, and demonstrate the generation of synthetic AqSolDB, LD50, and hERG data for 60k ligands, assessing both univariate property quality and downstream regression performance. Initial results show synthetic data can augment real data to improve regression outcomes on downstream tasks, though univariate data quality remains clustered near the mean and requires further optimization for broader practical impact. Overall, Syngand offers a scalable approach to addressing cross-dataset data sparsity in drug discovery, enabling researchers to probe questions that span multiple repositories.
Abstract
Artificial intelligence (AI) is increasingly used in every stage of drug development. Continuing breakthroughs in AI-based methods for drug discovery require the creation, improvement, and refinement of drug discovery data. We posit a new data challenge that slows the advancement of drug discovery AI: datasets are often collected independently from each other, often with little overlap, creating data sparsity. Data sparsity makes data curation difficult for researchers looking to answer key research questions requiring values posed across multiple datasets. We propose a novel diffusion GNN model Syngand capable of generating ligand and pharmacokinetic data end-to-end. We show and provide a methodology for sampling pharmacokinetic data for existing ligands using our Syngand model. We show the initial promising results on the efficacy of the Syngand-generated synthetic target property data on downstream regression tasks with AqSolDB, LD50, and hERG central. Using our proposed model and methodology, researchers can easily generate synthetic ligand data to help them explore research questions that require data spanning multiple datasets.
