Table of Contents
Fetching ...

Preferential Multi-Objective Bayesian Optimization for Drug Discovery

Tai Dang, Long-Hung Pham, Sang T. Truong, Ari Glenn, Wendy Nguyen, Edward A. Pham, Jeffrey S. Glenn, Sanmi Koyejo, Thang Luong

TL;DR

This work tackles the bottleneck of hit selection in virtual screening by introducing CheapVS, a chemist-guided preferential multi-objective Bayesian optimization framework that incorporates expert pairwise preferences into the search for multi-property drug candidates. It couples a lightweight diffusion-based binding-affinity measurement with a Gaussian-process-based preference model, enabling efficient exploration of large ligand libraries under a fixed computational budget. The key contributions are the learning of latent utility from expert preferences, a data-augmented diffusion docking approach (EDM-S) for scalable affinity estimation, and an end-to-end screening pipeline that significantly improves recovery of known drugs for EGFR and DRD2 compared with affinity-only baselines. The results demonstrate substantial efficiency gains in hit identification, with practical implications for accelerating drug discovery while balancing multiple pharmacokinetic and safety-related objectives.

Abstract

Despite decades of advancements in automated ligand screening, large-scale drug discovery remains resource-intensive and requires post-processing hit selection, a step where chemists manually select a few promising molecules based on their chemical intuition. This creates a major bottleneck in the virtual screening process for drug discovery, demanding experts to repeatedly balance complex trade-offs among drug properties across a vast pool of candidates. To improve the efficiency and reliability of this process, we propose a novel human-centered framework named CheapVS that allows chemists to guide the ligand selection process by providing preferences regarding the trade-offs between drug properties via pairwise comparison. Our framework combines preferential multi-objective Bayesian optimization with a docking model for measuring binding affinity to capture human chemical intuition for improving hit identification. Specifically, on a library of 100K chemical candidates targeting EGFR and DRD2, CheapVS outperforms state-of-the-art screening methods in identifying drugs within a limited computational budget. Notably, our method can recover up to 16/37 EGFR and 37/58 DRD2 known drugs while screening only 6% of the library, showcasing its potential to significantly advance drug discovery.

Preferential Multi-Objective Bayesian Optimization for Drug Discovery

TL;DR

This work tackles the bottleneck of hit selection in virtual screening by introducing CheapVS, a chemist-guided preferential multi-objective Bayesian optimization framework that incorporates expert pairwise preferences into the search for multi-property drug candidates. It couples a lightweight diffusion-based binding-affinity measurement with a Gaussian-process-based preference model, enabling efficient exploration of large ligand libraries under a fixed computational budget. The key contributions are the learning of latent utility from expert preferences, a data-augmented diffusion docking approach (EDM-S) for scalable affinity estimation, and an end-to-end screening pipeline that significantly improves recovery of known drugs for EGFR and DRD2 compared with affinity-only baselines. The results demonstrate substantial efficiency gains in hit identification, with practical implications for accelerating drug discovery while balancing multiple pharmacokinetic and safety-related objectives.

Abstract

Despite decades of advancements in automated ligand screening, large-scale drug discovery remains resource-intensive and requires post-processing hit selection, a step where chemists manually select a few promising molecules based on their chemical intuition. This creates a major bottleneck in the virtual screening process for drug discovery, demanding experts to repeatedly balance complex trade-offs among drug properties across a vast pool of candidates. To improve the efficiency and reliability of this process, we propose a novel human-centered framework named CheapVS that allows chemists to guide the ligand selection process by providing preferences regarding the trade-offs between drug properties via pairwise comparison. Our framework combines preferential multi-objective Bayesian optimization with a docking model for measuring binding affinity to capture human chemical intuition for improving hit identification. Specifically, on a library of 100K chemical candidates targeting EGFR and DRD2, CheapVS outperforms state-of-the-art screening methods in identifying drugs within a limited computational budget. Notably, our method can recover up to 16/37 EGFR and 37/58 DRD2 known drugs while screening only 6% of the library, showcasing its potential to significantly advance drug discovery.

Paper Structure

This paper contains 31 sections, 3 equations, 18 figures, 8 tables, 1 algorithm.

Figures (18)

  • Figure 1: Chemist-guided Active Preferential Virtual Screening performance in identifying EGFR and DRD2 drugs. The search is conducted on a 100K ligand library, screened for a maximum of 6% of the library. The plot compares different methods for structure-based binding affinity measurement (Vina, EDM-S, Chai-1) and objective types. The y-axis shows the percentage of the top number of approved drugs identified, while the x-axis represents the number of ligands screened. Multi-objective optimization (circles) across all methods for affinity measures outperforms affinity-only selection (triangles) and random screening (gray line). Error bars indicate 1 standard deviation.
  • Figure 2: Overview of Chemist-guided Active Preferential Virtual Screening ( CheapVS). Ligands from a large library are selected using an acquisition function and evaluated through structure-based affinity models. Chemists provide preference rankings, which inform a utility model to refine the selection process. The screened library iteratively improves, prioritizing ligands with desirable properties. Yellow-colored ligands represent found drug compounds, while purple ligands indicate screened compounds.
  • Figure 3: Accuracy over cumulative FLOPs on EGFR under the same screening settings. Vina achieves the highest accuracy with the fewest FLOPs, EDM-S is in between, and Chai uses the most FLOPs with the lowest accuracy.
  • Figure 4: Scatter plots comparing EDM Affinity with Vina Affinity (left) and RMSD (right). A moderate correlation is observed between EDM and Vina affinities (r = 0.52), while no meaningful correlation exists between RMSD and EDM Affinity (r = 0.18).
  • Figure 5: Predictive utility scores after BO on expert preference elicitation. Heatmaps show utility for two objectives (others fixed at the mean), while box plots compare the mean scores of drugs vs. non-drugs with 95% CI bars, highlighting the algorithm captures domain knowledge and balances competing properties.
  • ...and 13 more figures