Generalization within in silico screening
Andreas Loukas, Pan Kessel, Vladimir Gligorijevic, Richard Bonneau
TL;DR
This work reframes in silico screening as a policy-driven generalization problem, showing that the selectivity of batch-design policies and the rarity of predicted positives critically shape generalization. It extends learning theory with a PAC-Bayes, stability, and Lipschitz framework to bound the screening risk under a policy $\pi_f(x) \propto \alpha + f(x)$ and introduces batched prediction, where the mean batch label is predicted and evaluated. The paper proves that batching generally improves generalization, with bounds that benefit from larger batch sizes and are mitigated by early stopping, and validates these ideas empirically on antibody design and QM9 molecular property tasks. The results yield actionable guidance: use less aggressive per-sample selectivity (e.g., set $\alpha=1$) and rely on larger batch sizes to reliably forecast batch-quality, while remaining mindful of distribution shifts and the asymptotic nature of the bounds.
Abstract
In silico screening uses predictive models to select a batch of compounds with favorable properties from a library for experimental validation. Unlike conventional learning paradigms, success in this context is measured by the performance of the predictive model on the selected subset of compounds rather than the entire set of predictions. By extending learning theory, we show that the selectivity of the selection policy can significantly impact generalization, with a higher risk of errors occurring when exclusively selecting predicted positives and when targeting rare properties. Our analysis suggests a way to mitigate these challenges. We show that generalization can be markedly enhanced when considering a model's ability to predict the fraction of desired outcomes in a batch. This is promising, as the primary aim of screening is not necessarily to pinpoint the label of each compound individually, but rather to assemble a batch enriched for desirable compounds. Our theoretical insights are empirically validated across diverse tasks, architectures, and screening scenarios, underscoring their applicability.
