APEX: Approximate-but-exhaustive search for ultra-large combinatorial synthesis libraries
Aryan Pedawi, Jordi Silvestre-Ryan, Bradley Worley, Darren J Hsu, Kushal S Shah, Elias Stehle, Jingrong Zhang, Izhar Wallach
TL;DR
The paper tackles the infeasibility of exhaustively virtual screening ultra-large CSLs by introducing APEX, which combines a neural surrogate with a hierarchical, factorized representation of CSLs to enable near-exhaustive top-$k$ retrieval under practical GPU budgets. By training a multitask surrogate and a ReactionFactorizer that reconstructs embeddings from reaction-level structure, APEX can evaluate the entire library’s predictive scores using synthon-level contributions, achieving dramatic reductions in compute and memory while supporting constraint-aware queries. Evaluations on a benchmark CSL (>10M annotated compounds) show high recall for ground-truth top-$j$ compounds at budgets like $k=100{,}000$, and comparisons with Thompson sampling indicate robust, often superior performance, especially at low budgets; runtimes on CSLs with >$10^{10}$ compounds reach tens of seconds on a single GPU, enabling practical exhaustive-like screening. This work significantly lowers barriers to exhaustive virtual screening, enabling rapid, interactive exploration of vast chemical spaces and accelerating hypothesis testing in drug discovery.
Abstract
Make-on-demand combinatorial synthesis libraries (CSLs) like Enamine REAL have significantly enabled drug discovery efforts. However, their large size presents a challenge for virtual screening, where the goal is to identify the top compounds in a library according to a computational objective (e.g., optimizing docking score) subject to computational constraints under a limited computational budget. For current library sizes -- numbering in the tens of billions of compounds -- and scoring functions of interest, a routine virtual screening campaign may be limited to scoring fewer than 0.1% of the available compounds, leaving potentially many high scoring compounds undiscovered. Furthermore, as constraints (and sometimes objectives) change during the course of a virtual screening campaign, existing virtual screening algorithms typically offer little room for amortization. We propose the approximate-but-exhaustive search protocol for CSLs, or APEX. APEX utilizes a neural network surrogate that exploits the structure of CSLs in the prediction of objectives and constraints to make full enumeration on a consumer GPU possible in under a minute, allowing for exact retrieval of approximate top-$k$ sets. To demonstrate APEX's capabilities, we develop a benchmark CSL comprised of more than 10 million compounds, all of which have been annotated with their docking scores on five medically relevant targets along with physicohemical properties measured with RDKit such that, for any objective and set of constraints, the ground truth top-$k$ compounds can be identified and compared against the retrievals from any virtual screening algorithm. We show APEX's consistently strong performance both in retrieval accuracy and runtime compared to alternative methods.
