Table of Contents
Fetching ...

APEX: Approximate-but-exhaustive search for ultra-large combinatorial synthesis libraries

Aryan Pedawi, Jordi Silvestre-Ryan, Bradley Worley, Darren J Hsu, Kushal S Shah, Elias Stehle, Jingrong Zhang, Izhar Wallach

TL;DR

The paper tackles the infeasibility of exhaustively virtual screening ultra-large CSLs by introducing APEX, which combines a neural surrogate with a hierarchical, factorized representation of CSLs to enable near-exhaustive top-$k$ retrieval under practical GPU budgets. By training a multitask surrogate and a ReactionFactorizer that reconstructs embeddings from reaction-level structure, APEX can evaluate the entire library’s predictive scores using synthon-level contributions, achieving dramatic reductions in compute and memory while supporting constraint-aware queries. Evaluations on a benchmark CSL (>10M annotated compounds) show high recall for ground-truth top-$j$ compounds at budgets like $k=100{,}000$, and comparisons with Thompson sampling indicate robust, often superior performance, especially at low budgets; runtimes on CSLs with >$10^{10}$ compounds reach tens of seconds on a single GPU, enabling practical exhaustive-like screening. This work significantly lowers barriers to exhaustive virtual screening, enabling rapid, interactive exploration of vast chemical spaces and accelerating hypothesis testing in drug discovery.

Abstract

Make-on-demand combinatorial synthesis libraries (CSLs) like Enamine REAL have significantly enabled drug discovery efforts. However, their large size presents a challenge for virtual screening, where the goal is to identify the top compounds in a library according to a computational objective (e.g., optimizing docking score) subject to computational constraints under a limited computational budget. For current library sizes -- numbering in the tens of billions of compounds -- and scoring functions of interest, a routine virtual screening campaign may be limited to scoring fewer than 0.1% of the available compounds, leaving potentially many high scoring compounds undiscovered. Furthermore, as constraints (and sometimes objectives) change during the course of a virtual screening campaign, existing virtual screening algorithms typically offer little room for amortization. We propose the approximate-but-exhaustive search protocol for CSLs, or APEX. APEX utilizes a neural network surrogate that exploits the structure of CSLs in the prediction of objectives and constraints to make full enumeration on a consumer GPU possible in under a minute, allowing for exact retrieval of approximate top-$k$ sets. To demonstrate APEX's capabilities, we develop a benchmark CSL comprised of more than 10 million compounds, all of which have been annotated with their docking scores on five medically relevant targets along with physicohemical properties measured with RDKit such that, for any objective and set of constraints, the ground truth top-$k$ compounds can be identified and compared against the retrievals from any virtual screening algorithm. We show APEX's consistently strong performance both in retrieval accuracy and runtime compared to alternative methods.

APEX: Approximate-but-exhaustive search for ultra-large combinatorial synthesis libraries

TL;DR

The paper tackles the infeasibility of exhaustively virtual screening ultra-large CSLs by introducing APEX, which combines a neural surrogate with a hierarchical, factorized representation of CSLs to enable near-exhaustive top- retrieval under practical GPU budgets. By training a multitask surrogate and a ReactionFactorizer that reconstructs embeddings from reaction-level structure, APEX can evaluate the entire library’s predictive scores using synthon-level contributions, achieving dramatic reductions in compute and memory while supporting constraint-aware queries. Evaluations on a benchmark CSL (>10M annotated compounds) show high recall for ground-truth top- compounds at budgets like , and comparisons with Thompson sampling indicate robust, often superior performance, especially at low budgets; runtimes on CSLs with > compounds reach tens of seconds on a single GPU, enabling practical exhaustive-like screening. This work significantly lowers barriers to exhaustive virtual screening, enabling rapid, interactive exploration of vast chemical spaces and accelerating hypothesis testing in drug discovery.

Abstract

Make-on-demand combinatorial synthesis libraries (CSLs) like Enamine REAL have significantly enabled drug discovery efforts. However, their large size presents a challenge for virtual screening, where the goal is to identify the top compounds in a library according to a computational objective (e.g., optimizing docking score) subject to computational constraints under a limited computational budget. For current library sizes -- numbering in the tens of billions of compounds -- and scoring functions of interest, a routine virtual screening campaign may be limited to scoring fewer than 0.1% of the available compounds, leaving potentially many high scoring compounds undiscovered. Furthermore, as constraints (and sometimes objectives) change during the course of a virtual screening campaign, existing virtual screening algorithms typically offer little room for amortization. We propose the approximate-but-exhaustive search protocol for CSLs, or APEX. APEX utilizes a neural network surrogate that exploits the structure of CSLs in the prediction of objectives and constraints to make full enumeration on a consumer GPU possible in under a minute, allowing for exact retrieval of approximate top- sets. To demonstrate APEX's capabilities, we develop a benchmark CSL comprised of more than 10 million compounds, all of which have been annotated with their docking scores on five medically relevant targets along with physicohemical properties measured with RDKit such that, for any objective and set of constraints, the ground truth top- compounds can be identified and compared against the retrievals from any virtual screening algorithm. We show APEX's consistently strong performance both in retrieval accuracy and runtime compared to alternative methods.

Paper Structure

This paper contains 20 sections, 11 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Beyond commercially available make-on-demand CSLs, it is relatively straightforward to design an ultra-large CSL for virtual screening using publicly available libraries of enumerated compounds like ZINC22 and cheminformatics tools like RDKit. These designs are incredibly valuable for virtual screening due to their ability to densely cover large swaths of relevant chemical space.
  • Figure 2: The APEX (approximate-but-exhaustive) search protocol, enabling rapid, on-the-fly virtual screening of ultra-large CSLs. APEX consists of three main steps. Step 1: Train the surrogate. Given an enumerated and labeled dataset, a multi-task neural network is trained to predict molecular properties of interest, like docking scores. Step 2: Train the factorizer. Given a CSL, the reaction factorizer is trained to reconstruct embeddings of the surrogate model from reaction and R-group assignment pairs. The factorizer induces an approximation of surrogate properties that is amenable to substantial amortization in executing top-$k$ retrieval on ultra-large CSLs with respect to those properties. Step 3: Run approximate-but-exhaustive search. Given a search query (e.g., minimize docking score on target of interest subject to drug-likeness constraints), factorized surrogate properties are calculated for all compounds in the CSL and the top-$k$ are retrieved based on the objective subject to constraints. An efficient GPU implementation allows for running a top-$k$ search with $k=\text{1 million}$ on a 10 billion compound CSL in approximately 30 seconds with a single T4 GPU.
  • Figure 3: (A) Percent of compounds in the ground truth top-$j$ set retrieved by the APEX top-$k=\text{100,000}$ set from the 12M compound CSL. A random baseline will achieve a recall below 0.01. (B) Constraint satisfaction rates for the APEX retrievals. Black line denotes the base fraction of satisfying compounds in the library for each set of constraints.
  • Figure 4: Docking scores for the APEX top-$k=\text{100,000}$ on the 10B library are enriched with respect to the background distribution and with respect to the top-$k$ set from the smaller 12M library. Lower scores are better (i.e., indicate better interaction between ligand and receptor).
  • Figure 5: Example molecules from the 10B compound CSL.
  • ...and 3 more figures