Table of Contents
Fetching ...

Scoreformer: A Surrogate Model For Large-Scale Prediction of Docking Scores

Álvaro Ciudad, Adrián Morales-Pastor, Laura Malo, Isaac Filella-Mercè, Victor Guallar, Alexis Molina

TL;DR

ScoreFormer introduces a graph-transformer surrogate for large-scale docking-score prediction, combining Principal Neighborhood Aggregation (PNA) with Learnable Random Walk Positional Encodings (LRWPE) to capture local topology and long-range interactions in molecular graphs, while replacing the costly virtual node with an efficient attention mechanism. A lean variant, L-ScoreFormer, sacrifices few parameters for improved efficiency. Across seven datasets and multiple docking settings, ScoreFormer and L-ScoreFormer achieve competitive or superior docking-score prediction and hit-recovery performance compared with FiLMv2, and deliver substantial inference-speed gains (up to ~1.86x) on large-scale screens. Generalization tests demonstrate robustness to out-of-distribution chemistries and varying molecular weights, highlighting the practicality of these models for rapid HTVS campaigns in drug discovery. The work suggests fruitful directions in active learning, explainability, and uncertainty estimation to further enhance reliability and interpretability in computational chemistry.

Abstract

In this study, we present ScoreFormer, a novel graph transformer model designed to accurately predict molecular docking scores, thereby optimizing high-throughput virtual screening (HTVS) in drug discovery. The architecture integrates Principal Neighborhood Aggregation (PNA) and Learnable Random Walk Positional Encodings (LRWPE), enhancing the model's ability to understand complex molecular structures and their relationship with their respective docking scores. This approach significantly surpasses traditional HTVS methods and recent Graph Neural Network (GNN) models in both recovery and efficiency due to a wider coverage of the chemical space and enhanced performance. Our results demonstrate that ScoreFormer achieves competitive performance in docking score prediction and offers a substantial 1.65-fold reduction in inference time compared to existing models. We evaluated ScoreFormer across multiple datasets under various conditions, confirming its robustness and reliability in identifying potential drug candidates rapidly.

Scoreformer: A Surrogate Model For Large-Scale Prediction of Docking Scores

TL;DR

ScoreFormer introduces a graph-transformer surrogate for large-scale docking-score prediction, combining Principal Neighborhood Aggregation (PNA) with Learnable Random Walk Positional Encodings (LRWPE) to capture local topology and long-range interactions in molecular graphs, while replacing the costly virtual node with an efficient attention mechanism. A lean variant, L-ScoreFormer, sacrifices few parameters for improved efficiency. Across seven datasets and multiple docking settings, ScoreFormer and L-ScoreFormer achieve competitive or superior docking-score prediction and hit-recovery performance compared with FiLMv2, and deliver substantial inference-speed gains (up to ~1.86x) on large-scale screens. Generalization tests demonstrate robustness to out-of-distribution chemistries and varying molecular weights, highlighting the practicality of these models for rapid HTVS campaigns in drug discovery. The work suggests fruitful directions in active learning, explainability, and uncertainty estimation to further enhance reliability and interpretability in computational chemistry.

Abstract

In this study, we present ScoreFormer, a novel graph transformer model designed to accurately predict molecular docking scores, thereby optimizing high-throughput virtual screening (HTVS) in drug discovery. The architecture integrates Principal Neighborhood Aggregation (PNA) and Learnable Random Walk Positional Encodings (LRWPE), enhancing the model's ability to understand complex molecular structures and their relationship with their respective docking scores. This approach significantly surpasses traditional HTVS methods and recent Graph Neural Network (GNN) models in both recovery and efficiency due to a wider coverage of the chemical space and enhanced performance. Our results demonstrate that ScoreFormer achieves competitive performance in docking score prediction and offers a substantial 1.65-fold reduction in inference time compared to existing models. We evaluated ScoreFormer across multiple datasets under various conditions, confirming its robustness and reliability in identifying potential drug candidates rapidly.
Paper Structure (27 sections, 5 equations, 4 figures, 9 tables)

This paper contains 27 sections, 5 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Schematic representation of the architecture used in the ScoreFormer model.
  • Figure 2: Performance metrics by model type. Bar height indicate the mean performance across different targets, docking engines and docking settings. The metrics reported are those used for the evaluation of docking score prediction and hit recovery.
  • Figure 3: wMSE grouped by docking settings, target and model type. Data is only shown for the docking engine glide as is the only one presenting the two different types of docking settings.
  • Figure 4: Schematic representation of the pipeline used to generate a functional model able to predict docking scores of new molecules.