Achieving Well-Informed Decision-Making in Drug Discovery: A Comprehensive Calibration Study using Neural Network-Based Structure-Activity Models

Hannah Rosa Friesacher; Ola Engkvist; Lewis Mervin; Yves Moreau; Adam Arany

Achieving Well-Informed Decision-Making in Drug Discovery: A Comprehensive Calibration Study using Neural Network-Based Structure-Activity Models

Hannah Rosa Friesacher, Ola Engkvist, Lewis Mervin, Yves Moreau, Adam Arany

TL;DR

The paper tackles the challenge of well-calibrated uncertainty in neural models for drug-target interaction prediction. It systematically compares hyperparameter metrics and introduces Bayesian Linear Probing (BLP), a computationally efficient last-layer Bayesian approach, alongside post hoc Platt scaling and calibration-free methods. Across three ChEMBL targets, BCE loss and ACE as HP metrics consistently improve probability calibration and, in several cases, AUC as well; BLP achieves state-of-the-art calibration with reduced computational burden compared to full Bayesian treatments. The work provides practical guidance for building reliably calibrated models in drug discovery, enabling better-informed decision-making with potentially reduced experimental costs. Overall, it demonstrates that combining calibrated uncertainty with post hoc calibration can further enhance model reliability and decision quality in drug development pipelines.

Abstract

In the drug discovery process, where experiments can be costly and time-consuming, computational models that predict drug-target interactions are valuable tools to accelerate the development of new therapeutic agents. Estimating the uncertainty inherent in these neural network predictions provides valuable information that facilitates optimal decision-making when risk assessment is crucial. However, such models can be poorly calibrated, which results in unreliable uncertainty estimates that do not reflect the true predictive uncertainty. In this study, we compare different metrics, including accuracy and calibration scores, used for model hyperparameter tuning to investigate which model selection strategy achieves well-calibrated models. Furthermore, we propose to use a computationally efficient Bayesian uncertainty estimation method named Bayesian Linear Probing (BLP), which generates Hamiltonian Monte Carlo (HMC) trajectories to obtain samples for the parameters of a Bayesian Logistic Regression fitted to the hidden layer of the baseline neural network. We report that BLP improves model calibration and achieves the performance of common uncertainty quantification methods by combining the benefits of uncertainty estimation and probability calibration methods. Finally, we show that combining post hoc calibration method with well-performing uncertainty quantification approaches can boost model accuracy and calibration.

Achieving Well-Informed Decision-Making in Drug Discovery: A Comprehensive Calibration Study using Neural Network-Based Structure-Activity Models

TL;DR

Abstract

Paper Structure (17 sections, 3 equations, 6 figures, 7 tables)

This paper contains 17 sections, 3 equations, 6 figures, 7 tables.

Introduction
Related Work and Background
Post hoc Calibration Methods
Calibration-Free Uncertainty Quantification Methods
Methods
Datasets
Single-Task Model Generation
Experiments
Results and Discussion
Model Selection Study
Model Calibration Study
Conclusion
Acknowledgements
Availability of data and materials
Code availability
...and 2 more sections

Figures (6)

Figure 1: Overview of the dataset generation. The chemical structures were extracted from ChEMBL, and subsequently filtered and clustered. The clusters were assigned to five folds, which were used to set up a training, validation, and test fold. The training folds were used for MLP training. The validation set was used for hyperparameter tuning, as well as for fitting the logistic regression models for the deep ensemble model (MLP-E), and to choose the prior for Bayesian Linear Probing model (MLP-BLP), respectively.
Figure 2: Overview of the architecture of the MLP baseline model and the HP tuning workflow. The size of the hidden layer and the dropout rate, as well as the weight decay and learning rate used during training, were tuned in a grid search using a validation dataset. Four different HP optimization metrics (HP metrics) were used and the performances of the respective models were compared in a model selection study.
Figure 3: Overview of models and probability calibration approaches assessed in the model calibration study. The baseline model (MLP) was compared to the post hoc calibration method Platt scaling (MLP + P) and the Bayesian approaches MC dropout (MLP-D) and deep ensembles (MLP-E). Furthermore, the newly proposed Bayesian approach Bayesian Linear Probing (MLP-BLP) was included in the analysis. The models were trained on the training dataset. For the post hoc calibration approach (Platt scaling), the validation dataset was used to fit the logistic regression model.
Figure 4: Architecture of the combined models MLP-E + P and MLP-BLP + P. For generating Platt-scaled uncertainty quantification methods, a sigmoid was fit to the logits of the deep ensemble (MLP-E) and Bayesian Linear Probing (MLP-BLP) model. For the calibration step, an additional calibration dataset was used.
Figure 5: Results of the model selection study for target CYP3A4. The performance of ten model repetitions is shown. [A] Comparison of the calibration errors of models optimizing accuracy (ACC), AUCROC score (AUC), BCE loss, and the expected (ECE) and adaptive (ACE) calibration errors. [B] ACE vs. AUC of models tuned using different HP metrics. Models in the left upper corner (corresponding to high AUC and low ACE) perform best.
...and 1 more figures

Achieving Well-Informed Decision-Making in Drug Discovery: A Comprehensive Calibration Study using Neural Network-Based Structure-Activity Models

TL;DR

Abstract

Achieving Well-Informed Decision-Making in Drug Discovery: A Comprehensive Calibration Study using Neural Network-Based Structure-Activity Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)