Table of Contents
Fetching ...

Generalization Beyond Benchmarks: Evaluating Learnable Protein-Ligand Scoring Functions on Unseen Targets

Jakub Kopko, David Graber, Saltuk Mustafa Eyrilmez, Stanislav Mazurenko, David Bednar, Jiri Sedlar, Josef Sivic

TL;DR

This work probes how well learnable protein–ligand scoring functions generalize to unseen targets, revealing that standard benchmarks mask substantial generalization gaps. By constructing strict pocket-level OOD splits and evaluating two leading scorers (GEMS and GenScore), the study demonstrates limited transfer to novel targets and underscores biases in existing benchmarks. It further tests whether large-scale self-supervised representations (ATOMICA embeddings) can bridge the gap and explores how sparse target-specific data can aid validation or fine-tuning, with mixed but generally positive effects. The findings advocate for more rigorous, target-aware evaluation protocols and suggest that richer representations, plus targeted data, can improve robustness to novel proteins in real-world drug discovery contexts.

Abstract

As machine learning becomes increasingly central to molecular design, it is vital to ensure the reliability of learnable protein-ligand scoring functions on novel protein targets. While many scoring functions perform well on standard benchmarks, their ability to generalize beyond training data remains a significant challenge. In this work, we evaluate the generalization capability of state-of-the-art scoring functions on dataset splits that simulate evaluation on targets with a limited number of known structures and experimental affinity measurements. Our analysis reveals that the commonly used benchmarks do not reflect the true challenge of generalizing to novel targets. We also investigate whether large-scale self-supervised pretraining can bridge this generalization gap and we provide preliminary evidence of its potential. Furthermore, we probe the efficacy of simple methods that leverage limited test-target data to improve scoring function performance. Our findings underscore the need for more rigorous evaluation protocols and offer practical guidance for designing scoring functions with predictive power extending to novel protein targets.

Generalization Beyond Benchmarks: Evaluating Learnable Protein-Ligand Scoring Functions on Unseen Targets

TL;DR

This work probes how well learnable protein–ligand scoring functions generalize to unseen targets, revealing that standard benchmarks mask substantial generalization gaps. By constructing strict pocket-level OOD splits and evaluating two leading scorers (GEMS and GenScore), the study demonstrates limited transfer to novel targets and underscores biases in existing benchmarks. It further tests whether large-scale self-supervised representations (ATOMICA embeddings) can bridge the gap and explores how sparse target-specific data can aid validation or fine-tuning, with mixed but generally positive effects. The findings advocate for more rigorous, target-aware evaluation protocols and suggest that richer representations, plus targeted data, can improve robustness to novel proteins in real-world drug discovery contexts.

Abstract

As machine learning becomes increasingly central to molecular design, it is vital to ensure the reliability of learnable protein-ligand scoring functions on novel protein targets. While many scoring functions perform well on standard benchmarks, their ability to generalize beyond training data remains a significant challenge. In this work, we evaluate the generalization capability of state-of-the-art scoring functions on dataset splits that simulate evaluation on targets with a limited number of known structures and experimental affinity measurements. Our analysis reveals that the commonly used benchmarks do not reflect the true challenge of generalizing to novel targets. We also investigate whether large-scale self-supervised pretraining can bridge this generalization gap and we provide preliminary evidence of its potential. Furthermore, we probe the efficacy of simple methods that leverage limited test-target data to improve scoring function performance. Our findings underscore the need for more rigorous evaluation protocols and offer practical guidance for designing scoring functions with predictive power extending to novel protein targets.

Paper Structure

This paper contains 22 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The full proteins (top) and their pockets (bottom) ... [rest of caption] ...
  • Figure 2: t-SNE projections of ATOMICA embeddings of ligand–pocket interactions from PDBbind, colored by experimental affinity (left) and molecular weight (right). The large, crescent-shaped cluster shows clear gradients of affinity and molecular weight, while the smaller, well-separated cluster contains most of the largest ligands, indicating that ATOMICA space captures both properties.
  • Figure 3: Evolution of the scoring power of the three scoring methods GenScore, GEMS and GEMSATOMICA during model training with the original stratified k-fold splitting of the train–validation data, evaluated on CASF-2016 and the out-of-distribution (OOD) clusters 2P15 (left) and 2VW5 (right). The x-axis shows training progress as a percentage of total training epochs, while the y-axis displays the Pearson correlation coefficient $\uparrow$ between predicted and true affinity. Each line represents the mean performance across five cross-validation folds, with shaded uncertainty regions ($\pm1$ standard deviation) indicating the variability in performance across the five training runs. Note that the uncertainty regions disappear towards higher epoch numbers due to early stopping, which results in different models completing training at different epochs. Consequently, the later portions of the curves are based on fewer than five models. The final results imply that the original stratified k-fold splitting leads to overfitted models, and that using limited test-target validation data allows for better early stopping and the selection of better models.
  • Figure S1: Upper left: t-SNE projections of ATOMICA embeddings of ligand–pocket interactions from PDBbind. The three colors show projected embeddings from three different test target clusters. One sample from each cluster is indicated by a star. The complexes corresponding to the three stars are shown in the remaining plots of this figure. These projections indicate that the ATOMICA space is organized by pocket shape, with some pockets forming tight regions. Considered together with Figure \ref{['fig:tsne_aff']}, they suggest that these regions align with specific ranges of ligand sizes and affinities; in our experiments, scoring interactions representing such regions was particularly difficult when no representatives were available during training. Upper right: 3HKU, example from the 3DD0 cluster. Lower right: 3QAA, example from the 3O9I cluster. Lower left: 5ITF, example from the 3F3E cluster.
  • Figure S2: Evolution of the scoring power of the three scoring methods GenScore, GEMS, and GEMSATOMICA during model training, evaluated on CASF-2016 and the out-of-distribution (OOD) clusters 3DD0, 3F3E, 1NVQ, 3O9I and 1SQA. The x-axis shows training progress as a percentage of total training epochs, while the y-axis displays the Pearson correlation coefficient $\uparrow$ between predicted and true affinity. Each line represents the mean performance across five cross-validation folds, with shaded uncertainty regions ($\pm1$ standard deviation) indicating the variability in performance across the five training runs. Note that the uncertainty regions disappear towards higher epoch numbers due to early stopping, which results in different models completing training at different epochs. Consequently, the later portions of the curves are based on fewer than five models.
  • ...and 1 more figures