Table of Contents
Fetching ...

PSBench: a large-scale benchmark for estimating the accuracy of protein complex structural models

Pawan Neupane, Jian Liu, Jianlin Cheng

TL;DR

PSBench delivers a large-scale, publicly available benchmark for estimating the accuracy of protein complex structural models, addressing EMA data scarcity with four CASP-derived datasets (CASP15/16) totaling over one million models annotated with 10 quality scores across global, interface, and local levels. It provides automated labeling tools, baseline EMA methods, and standardized metrics to enable rigorous training and benchmarking, demonstrated by the strong performance of GATE-based EMA models in CASP16. This resource supports development of generalizable EMA methods for complex structures and was shown to drive competitive model ranking and selection in blind community-wide evaluations. PSBench thus offers a practical, scalable framework akin to ImageNet for EMA research in protein complex modeling, with ongoing plans to expand targets and invite community contributions.

Abstract

Predicting protein complex structures is essential for protein function analysis, protein design, and drug discovery. While AI methods like AlphaFold can predict accurate structural models for many protein complexes, reliably estimating the quality of these predicted models (estimation of model accuracy, or EMA) for model ranking and selection remains a major challenge. A key barrier to developing effective machine learning-based EMA methods is the lack of large, diverse, and well-annotated datasets for training and evaluation. To address this gap, we introduce PSBench, a benchmark suite comprising four large-scale, labeled datasets generated during the 15th and 16th community-wide Critical Assessment of Protein Structure Prediction (CASP15 and CASP16). PSBench includes over one million structural models covering a wide range of protein sequence lengths, complex stoichiometries, functional classes, and modeling difficulties. Each model is annotated with multiple complementary quality scores at the global, local, and interface levels. PSBench also provides multiple evaluation metrics and baseline EMA methods to facilitate rigorous comparisons. To demonstrate PSBench's utility, we trained and evaluated GATE, a graph transformer-based EMA method, on the CASP15 data. GATE was blindly tested in CASP16 (2024), where it ranked among the top-performing EMA methods. These results highlight PSBench as a valuable resource for advancing EMA research in protein complex modeling. PSBench is publicly available at: https://github.com/BioinfoMachineLearning/PSBench.

PSBench: a large-scale benchmark for estimating the accuracy of protein complex structural models

TL;DR

PSBench delivers a large-scale, publicly available benchmark for estimating the accuracy of protein complex structural models, addressing EMA data scarcity with four CASP-derived datasets (CASP15/16) totaling over one million models annotated with 10 quality scores across global, interface, and local levels. It provides automated labeling tools, baseline EMA methods, and standardized metrics to enable rigorous training and benchmarking, demonstrated by the strong performance of GATE-based EMA models in CASP16. This resource supports development of generalizable EMA methods for complex structures and was shown to drive competitive model ranking and selection in blind community-wide evaluations. PSBench thus offers a practical, scalable framework akin to ImageNet for EMA research in protein complex modeling, with ongoing plans to expand targets and invite community contributions.

Abstract

Predicting protein complex structures is essential for protein function analysis, protein design, and drug discovery. While AI methods like AlphaFold can predict accurate structural models for many protein complexes, reliably estimating the quality of these predicted models (estimation of model accuracy, or EMA) for model ranking and selection remains a major challenge. A key barrier to developing effective machine learning-based EMA methods is the lack of large, diverse, and well-annotated datasets for training and evaluation. To address this gap, we introduce PSBench, a benchmark suite comprising four large-scale, labeled datasets generated during the 15th and 16th community-wide Critical Assessment of Protein Structure Prediction (CASP15 and CASP16). PSBench includes over one million structural models covering a wide range of protein sequence lengths, complex stoichiometries, functional classes, and modeling difficulties. Each model is annotated with multiple complementary quality scores at the global, local, and interface levels. PSBench also provides multiple evaluation metrics and baseline EMA methods to facilitate rigorous comparisons. To demonstrate PSBench's utility, we trained and evaluated GATE, a graph transformer-based EMA method, on the CASP15 data. GATE was blindly tested in CASP16 (2024), where it ranked among the top-performing EMA methods. These results highlight PSBench as a valuable resource for advancing EMA research in protein complex modeling. PSBench is publicly available at: https://github.com/BioinfoMachineLearning/PSBench.

Paper Structure

This paper contains 33 sections, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Overview of PSBench.(a) Pipeline. The PSBench pipeline for preparing four CASP datasets for estimating protein complex model accuracy (EMA). The predicted structural models are compared with native (true) structures to compute global, local, interface quality scores as labels. (b) Methods. Six representative baseline EMA methods for performance comparison. (c) Metrics. Four metrics for evaluating EMA methods: Pearson's correlation, Spearman's correlation, ranking loss, and AUROC (Area Under Receiver Operating Characteristics Curve) for evaluating predicted model quality scores against true ones (labels). The evaluation tools are included in PSBench.
  • Figure 2: CASP15_inhouse_dataset.(a) Model count. Number of models per target in the dataset. (b) Score Distribution. Box plots of each of six representative quality scores of the models for each target. (c) Example. Three representative models (worst, average, best) in terms of sum of the six representative quality scores for a target H1143. Each model with two chains colored in blue and red is superimposed with the true structure in gray.
  • Figure 3: CASP16 EMA results. The performance of top 20 out of 38 CASP16 EMA predictors in predicting TM-scores of structural models of 37 complex targets. (a) Pearson's correlation. (b) Spearman's correlation. (c) Ranking loss. (d) AUROC. MULTICOM_GATE is highlighted in red. It ranked first, third, third, and third in terms of the four metrics respectively.
  • Figure S1: Diversity of 79 protein complex targets.(a) Number of targets for each of 25 stoichiometries represented in PSBench. A stoichiometry is denoted by letters interleaved with numbers. Each letter represents a unique chain. The number following a letter is the number of the copies (count) of the chain. For instance, A1B2 means a complex has two unique chains A and B, while A has one copy and B has two copies. (b) Number of targets for each of 21 broad protein function classes and an "Unknown" class in PSBench. "Unknown" means there is no class information and therefore may include many different classes.
  • Figure S2: Distribution of AFM confidence scores per target in CASP15_inhouse_dataset.
  • ...and 6 more figures