Table of Contents
Fetching ...

QComp: A QSAR-Based Data Completion Framework for Drug Discovery

Bingjia Yang, Yunsie Chung, Archer Y. Yang, Bo Yuan, Xiang Yu

TL;DR

QComp addresses sparse, evolving ADMET data in QSAR‑based drug discovery by modeling missing endpoints as a conditional Gaussian dependent on existing QSAR predictions. It calibrates a linear mapping of QSAR outputs with a shared covariance structure, and learns this with maximum likelihood, enabling one‑shot data completion through conditional means. Across ADMET‑750k, public ADMET, peptide, and in vivo/animal datasets, QComp outperforms standard imputers and base QSARs, demonstrates robustness to assay correlations, and enables rational experiment design via Gain of Certainty (GOC). The approach yields improved predictive accuracy and practical guidance for prioritizing experiments, with broad applicability to material and peptide discovery contexts.

Abstract

In drug discovery, in vitro and in vivo experiments reveal biochemical activities related to the efficacy and toxicity of compounds. The experimental data accumulate into massive, ever-evolving, and sparse datasets. Quantitative Structure-Activity Relationship (QSAR) models, which predict biochemical activities using only the structural information of compounds, face challenges in integrating the evolving experimental data as studies progress. We develop QSAR-Complete (QComp), a data completion framework to address this issue. Based on pre-existing QSAR models, QComp utilizes the correlation inherent in experimental data to enhance prediction accuracy across various tasks. Moreover, QComp emerges as a promising tool for guiding the optimal sequence of experiments by quantifying the reduction in statistical uncertainty for specific endpoints, thereby aiding in rational decision-making throughout the drug discovery process.

QComp: A QSAR-Based Data Completion Framework for Drug Discovery

TL;DR

QComp addresses sparse, evolving ADMET data in QSAR‑based drug discovery by modeling missing endpoints as a conditional Gaussian dependent on existing QSAR predictions. It calibrates a linear mapping of QSAR outputs with a shared covariance structure, and learns this with maximum likelihood, enabling one‑shot data completion through conditional means. Across ADMET‑750k, public ADMET, peptide, and in vivo/animal datasets, QComp outperforms standard imputers and base QSARs, demonstrates robustness to assay correlations, and enables rational experiment design via Gain of Certainty (GOC). The approach yields improved predictive accuracy and practical guidance for prioritizing experiments, with broad applicability to material and peptide discovery contexts.

Abstract

In drug discovery, in vitro and in vivo experiments reveal biochemical activities related to the efficacy and toxicity of compounds. The experimental data accumulate into massive, ever-evolving, and sparse datasets. Quantitative Structure-Activity Relationship (QSAR) models, which predict biochemical activities using only the structural information of compounds, face challenges in integrating the evolving experimental data as studies progress. We develop QSAR-Complete (QComp), a data completion framework to address this issue. Based on pre-existing QSAR models, QComp utilizes the correlation inherent in experimental data to enhance prediction accuracy across various tasks. Moreover, QComp emerges as a promising tool for guiding the optimal sequence of experiments by quantifying the reduction in statistical uncertainty for specific endpoints, thereby aiding in rational decision-making throughout the drug discovery process.
Paper Structure (35 sections, 10 equations, 9 figures, 5 tables)

This paper contains 35 sections, 10 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: (a,b) Histograms of the "microsome Cl" assays for dogs and humans. (c) The heatmap of the joint distribution of "microsome Cl, dog" and "microsome Cl, human". (d,e) Histograms of the deviation of "microsome Cl" assays from the QSAR predictions. (f) The heatmap of the joint distribution associated with the quantities in (d) and (e).
  • Figure 2: $r^2$ scores of QComp and the base (random forest) QSAR model on the peptide dataset with random splitting.
  • Figure 3: Gain of certainty accumulated along the optimal (greedy) sequence of in vitro assays.
  • Figure S1: Performance of QComp and base QSAR model, Chemprop, on the masked ADMET-750k dataset (assay-based temporal splitting).
  • Figure S2: Performance of QComp and base QSAR model (Chemprop) on the public dataset (random splitting).
  • ...and 4 more figures