Understanding active learning of molecular docking and its applications
Jeonghyeon Kim, Juno Nam, Seongok Ryu
TL;DR
This work analyzes how active learning for molecular docking leverages surrogate GNNs to predict docking scores using only 2D ligand information. It benchmarks six receptors to understand when 2D-only surrogates can reliably identify top docking compounds, revealing that models tend to memorize common structural patterns in high-scoring samples and that predictive accuracy is highest for top-ranked compounds. Greedy and UCB acquisitions maximize recovery of top docking scores, but induce biased sampling that limits rank-order reliability beyond the top tier; uncertainty-driven acquisition yields better overall prediction accuracy. Despite these biases, surrogate-guided screening remains effective for reducing docking costs and identifying actives in large libraries such as EnamineREAL, with practical guidance on when to use 2D surrogates, how to handle receptor pocket characteristics, and potential two-stage or human-in-the-loop strategies.
Abstract
With the advancing capabilities of computational methodologies and resources, ultra-large-scale virtual screening via molecular docking has emerged as a prominent strategy for in silico hit discovery. Given the exhaustive nature of ultra-large-scale virtual screening, active learning methodologies have garnered attention as a means to mitigate computational cost through iterative small-scale docking and machine learning model training. While the efficacy of active learning methodologies has been empirically validated in extant literature, a critical investigation remains in how surrogate models can predict docking score without considering three-dimensional structural features, such as receptor conformation and binding poses. In this paper, we thus investigate how active learning methodologies effectively predict docking scores using only 2D structures and under what circumstances they may work particularly well through benchmark studies encompassing six receptor targets. Our findings suggest that surrogate models tend to memorize structural patterns prevalent in high docking scored compounds obtained during acquisition steps. Despite this tendency, surrogate models demonstrate utility in virtual screening, as exemplified in the identification of actives from DUD-E dataset and high docking-scored compounds from EnamineReal library, a significantly larger set than the initial screening pool. Our comprehensive analysis underscores the reliability and potential applicability of active learning methodologies in virtual screening campaigns.
