Table of Contents
Fetching ...

Understanding active learning of molecular docking and its applications

Jeonghyeon Kim, Juno Nam, Seongok Ryu

TL;DR

This work analyzes how active learning for molecular docking leverages surrogate GNNs to predict docking scores using only 2D ligand information. It benchmarks six receptors to understand when 2D-only surrogates can reliably identify top docking compounds, revealing that models tend to memorize common structural patterns in high-scoring samples and that predictive accuracy is highest for top-ranked compounds. Greedy and UCB acquisitions maximize recovery of top docking scores, but induce biased sampling that limits rank-order reliability beyond the top tier; uncertainty-driven acquisition yields better overall prediction accuracy. Despite these biases, surrogate-guided screening remains effective for reducing docking costs and identifying actives in large libraries such as EnamineREAL, with practical guidance on when to use 2D surrogates, how to handle receptor pocket characteristics, and potential two-stage or human-in-the-loop strategies.

Abstract

With the advancing capabilities of computational methodologies and resources, ultra-large-scale virtual screening via molecular docking has emerged as a prominent strategy for in silico hit discovery. Given the exhaustive nature of ultra-large-scale virtual screening, active learning methodologies have garnered attention as a means to mitigate computational cost through iterative small-scale docking and machine learning model training. While the efficacy of active learning methodologies has been empirically validated in extant literature, a critical investigation remains in how surrogate models can predict docking score without considering three-dimensional structural features, such as receptor conformation and binding poses. In this paper, we thus investigate how active learning methodologies effectively predict docking scores using only 2D structures and under what circumstances they may work particularly well through benchmark studies encompassing six receptor targets. Our findings suggest that surrogate models tend to memorize structural patterns prevalent in high docking scored compounds obtained during acquisition steps. Despite this tendency, surrogate models demonstrate utility in virtual screening, as exemplified in the identification of actives from DUD-E dataset and high docking-scored compounds from EnamineReal library, a significantly larger set than the initial screening pool. Our comprehensive analysis underscores the reliability and potential applicability of active learning methodologies in virtual screening campaigns.

Understanding active learning of molecular docking and its applications

TL;DR

This work analyzes how active learning for molecular docking leverages surrogate GNNs to predict docking scores using only 2D ligand information. It benchmarks six receptors to understand when 2D-only surrogates can reliably identify top docking compounds, revealing that models tend to memorize common structural patterns in high-scoring samples and that predictive accuracy is highest for top-ranked compounds. Greedy and UCB acquisitions maximize recovery of top docking scores, but induce biased sampling that limits rank-order reliability beyond the top tier; uncertainty-driven acquisition yields better overall prediction accuracy. Despite these biases, surrogate-guided screening remains effective for reducing docking costs and identifying actives in large libraries such as EnamineREAL, with practical guidance on when to use 2D surrogates, how to handle receptor pocket characteristics, and potential two-stage or human-in-the-loop strategies.

Abstract

With the advancing capabilities of computational methodologies and resources, ultra-large-scale virtual screening via molecular docking has emerged as a prominent strategy for in silico hit discovery. Given the exhaustive nature of ultra-large-scale virtual screening, active learning methodologies have garnered attention as a means to mitigate computational cost through iterative small-scale docking and machine learning model training. While the efficacy of active learning methodologies has been empirically validated in extant literature, a critical investigation remains in how surrogate models can predict docking score without considering three-dimensional structural features, such as receptor conformation and binding poses. In this paper, we thus investigate how active learning methodologies effectively predict docking scores using only 2D structures and under what circumstances they may work particularly well through benchmark studies encompassing six receptor targets. Our findings suggest that surrogate models tend to memorize structural patterns prevalent in high docking scored compounds obtained during acquisition steps. Despite this tendency, surrogate models demonstrate utility in virtual screening, as exemplified in the identification of actives from DUD-E dataset and high docking-scored compounds from EnamineReal library, a significantly larger set than the initial screening pool. Our comprehensive analysis underscores the reliability and potential applicability of active learning methodologies in virtual screening campaigns.
Paper Structure (17 sections, 11 equations, 13 figures, 4 tables)

This paper contains 17 sections, 11 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Our workflow for active learning of molecular docking. We used the EnamineHTSEnamineHTS library as a pool of ligands and AutoDock Vinatrott2010autodock as a docking simulation tool.
  • Figure 2: The root-mean-square error (RMSE) between docking scores and prediction scores plotted against the number of acquisitions, with varying acquisition strategies and for six receptor targets.
  • Figure 3: The coefficient of determination $R^{2}$ between docking scores and prediction scores plotted against the number of acquisitions.
  • Figure 4: The recovery rate of top-1000 compounds plotted against the number of acquisitions.
  • Figure 5: Change of $i$-th interval hit ratio (eq \ref{['eq:HR']}) as the interval index $i$ increases. Predictions from the surrogate GNN models are accurate only for top-docking-scored compounds.
  • ...and 8 more figures