Table of Contents
Fetching ...

Pretrained Joint Predictions for Scalable Batch Bayesian Optimization of Molecular Designs

Miles Wang-Henderson, Benjamin Kaufman, Edward Williams, Ryan Pederson, Matteo Rossi, Owen Howell, Carl Underkoffler, Narbe Mardirossian, John Parkhill

TL;DR

The paper tackles the bottleneck of scalable Batch Bayesian Optimization in molecular design by developing Epistemic Neural Networks (ENNs) with pretrained prior networks, termed Epinet, to produce fast, joint predictive distributions over binding affinities. By leveraging structure-informed representations (e.g., COATI embeddings) and synthetic GP-based priors, the approach enables efficient parallel acquisition via qPO and EMAX, reducing the number of iterations needed to discover potent compounds. The authors demonstrate substantial gains on two benchmarks: rediscovery of potent EGFR inhibitors and screening a large tArray library, with up to 5x fewer iterations to Top-1 pIC50 and robust batch performance when sampling from the joint ENN predictive distribution. These results suggest a practical, scalable pathway to accelerate large-scale drug discovery, with potential extensions to richer structure-aware representations and multi-property optimization.

Abstract

Batched synthesis and testing of molecular designs is the key bottleneck of drug development. There has been great interest in leveraging biomolecular foundation models as surrogates to accelerate this process. In this work, we show how to obtain scalable probabilistic surrogates of binding affinity for use in Batch Bayesian Optimization (Batch BO). This demands parallel acquisition functions that hedge between designs and the ability to rapidly sample from a joint predictive density to approximate them. Through the framework of Epistemic Neural Networks (ENNs), we obtain scalable joint predictive distributions of binding affinity on top of representations taken from large structure-informed models. Key to this work is an investigation into the importance of prior networks in ENNs and how to pretrain them on synthetic data to improve downstream performance in Batch BO. Their utility is demonstrated by rediscovering known potent EGFR inhibitors on a semi-synthetic benchmark in up to 5x fewer iterations, as well as potent inhibitors from a real-world small-molecule library in up to 10x fewer iterations, offering a promising solution for large-scale drug discovery applications.

Pretrained Joint Predictions for Scalable Batch Bayesian Optimization of Molecular Designs

TL;DR

The paper tackles the bottleneck of scalable Batch Bayesian Optimization in molecular design by developing Epistemic Neural Networks (ENNs) with pretrained prior networks, termed Epinet, to produce fast, joint predictive distributions over binding affinities. By leveraging structure-informed representations (e.g., COATI embeddings) and synthetic GP-based priors, the approach enables efficient parallel acquisition via qPO and EMAX, reducing the number of iterations needed to discover potent compounds. The authors demonstrate substantial gains on two benchmarks: rediscovery of potent EGFR inhibitors and screening a large tArray library, with up to 5x fewer iterations to Top-1 pIC50 and robust batch performance when sampling from the joint ENN predictive distribution. These results suggest a practical, scalable pathway to accelerate large-scale drug discovery, with potential extensions to richer structure-aware representations and multi-property optimization.

Abstract

Batched synthesis and testing of molecular designs is the key bottleneck of drug development. There has been great interest in leveraging biomolecular foundation models as surrogates to accelerate this process. In this work, we show how to obtain scalable probabilistic surrogates of binding affinity for use in Batch Bayesian Optimization (Batch BO). This demands parallel acquisition functions that hedge between designs and the ability to rapidly sample from a joint predictive density to approximate them. Through the framework of Epistemic Neural Networks (ENNs), we obtain scalable joint predictive distributions of binding affinity on top of representations taken from large structure-informed models. Key to this work is an investigation into the importance of prior networks in ENNs and how to pretrain them on synthetic data to improve downstream performance in Batch BO. Their utility is demonstrated by rediscovering known potent EGFR inhibitors on a semi-synthetic benchmark in up to 5x fewer iterations, as well as potent inhibitors from a real-world small-molecule library in up to 10x fewer iterations, offering a promising solution for large-scale drug discovery applications.

Paper Structure

This paper contains 13 sections, 7 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: Simplified overview of an ENN-based architecture that uses a pretrained prior network, with optional inclusion of different fixed latent representations as input. In our experiments, we use COATI, a ligand-only representation kaufman2024coati, to address a single target.
  • Figure 2: A scalable strategy for sampling is necessary to yield convergent estimates for Batch BO. Estimates of the expected maximum (EMAX) pIC50 of a batch of 25 compounds as a function of number of particles under 10 draws, e.g., sample paths from a Gaussian Process (GP) or epistemic index draws from an Epistemic Neural Network (ENN). This shows error in estimate is negligible after the number of particles is proportional to square of batch size.
  • Figure 3: Left: Sample paths from the joint predictive distribution of a Pretrained Epinet with a frozen prior network, after training the learnable component on 10 observations. Paths drawn in blue using $K=100$ epistemic particles. True function and training points in grey. We see that the joint predictive distribution is well-calibrated and covers the true function. Right: empirical marginal density of samples from $f_\theta + f_\phi$ in blue, and prior network only $f_\phi$ in purple.
  • Figure 4: Comparison across different input-dimensions of negative log-loss on a test subset after training on a small subset of a warped GP sample-path. Epinet variants are not strongly distinguished on marginal negative log-loss (NLL). As expected, the Pretrained (PT) Epinet does consistently well, also on joint negative log-loss evaluated using augmented dyadic sampling osband2021epistemic.
  • Figure 5: Performance of different Epinet variants and acquisition functions in maximizing pIC50 on the EGFR dataset. Compared to a greedy baseline, using Pretrained Epinets allow us to retrieve the same Top-1 pIC50 in 5x fewer iterations and the same Top-10 mean pIC50 in 7x fewer iterations. Moreover, the final iteration yields more potent molecules than other baselines. An absolute improvement of 0.1 in normalized Top-1 pIC50 retrieved corresponds to an approximately 14x reduction in IC50 concentration. For all curves we plot the mean and standard errors over 20 random seeds. Left: y-axis shows Top-1 pIC50 retrieved per iteration, normalized by the true maximum pIC50 in the dataset. Right: shows the mean pIC50 of the Top-10 highest retrieved compounds, also normalized. See also Table \ref{['tab:egfr_results']}.
  • ...and 1 more figures