QComp: A QSAR-Based Data Completion Framework for Drug Discovery
Bingjia Yang, Yunsie Chung, Archer Y. Yang, Bo Yuan, Xiang Yu
TL;DR
QComp addresses sparse, evolving ADMET data in QSAR‑based drug discovery by modeling missing endpoints as a conditional Gaussian dependent on existing QSAR predictions. It calibrates a linear mapping of QSAR outputs with a shared covariance structure, and learns this with maximum likelihood, enabling one‑shot data completion through conditional means. Across ADMET‑750k, public ADMET, peptide, and in vivo/animal datasets, QComp outperforms standard imputers and base QSARs, demonstrates robustness to assay correlations, and enables rational experiment design via Gain of Certainty (GOC). The approach yields improved predictive accuracy and practical guidance for prioritizing experiments, with broad applicability to material and peptide discovery contexts.
Abstract
In drug discovery, in vitro and in vivo experiments reveal biochemical activities related to the efficacy and toxicity of compounds. The experimental data accumulate into massive, ever-evolving, and sparse datasets. Quantitative Structure-Activity Relationship (QSAR) models, which predict biochemical activities using only the structural information of compounds, face challenges in integrating the evolving experimental data as studies progress. We develop QSAR-Complete (QComp), a data completion framework to address this issue. Based on pre-existing QSAR models, QComp utilizes the correlation inherent in experimental data to enhance prediction accuracy across various tasks. Moreover, QComp emerges as a promising tool for guiding the optimal sequence of experiments by quantifying the reduction in statistical uncertainty for specific endpoints, thereby aiding in rational decision-making throughout the drug discovery process.
