Revealing data leakage in protein interaction benchmarks

Anton Bushuiev; Roman Bushuiev; Jiri Sedlar; Tomas Pluskal; Jiri Damborsky; Stanislav Mazurenko; Josef Sivic

Revealing data leakage in protein interaction benchmarks

Anton Bushuiev, Roman Bushuiev, Jiri Sedlar, Tomas Pluskal, Jiri Damborsky, Stanislav Mazurenko, Josef Sivic

TL;DR

Protein-interaction benchmarks often rely on train-test splits based on metadata or sequence similarity, which introduce substantial data leakage and overestimate generalization. The authors quantify leakage using large-scale interface-structure comparison (iDist) across PDB-derived PPIs and a SKEMPI-based benchmark, showing pervasive leakage in standard splits. They review existing approaches in PIP, docking, and binder design, and present interface-structure-based splitting as a robust alternative, along with methods like Foldseek and TM-align to enable scalable non-leaking partitions. They also emphasize the value of domain expertise in constructing high-quality splits and outline concrete recommendations for reporting leakage and adopting interface-based evaluation to drive meaningful progress.

Abstract

In recent years, there has been remarkable progress in machine learning for protein-protein interactions. However, prior work has predominantly focused on improving learning algorithms, with less attention paid to evaluation strategies and data preparation. Here, we demonstrate that further development of machine learning methods may be hindered by the quality of existing train-test splits. Specifically, we find that commonly used splitting strategies for protein complexes, based on protein sequence or metadata similarity, introduce major data leakage. This may result in overoptimistic evaluation of generalization, as well as unfair benchmarking of the models, biased towards assessing their overfitting capacity rather than practical utility. To overcome the data leakage, we recommend constructing data splits based on 3D structural similarity of protein-protein interfaces and suggest corresponding algorithms. We believe that addressing the data leakage problem is critical for further progress in this research area.

Revealing data leakage in protein interaction benchmarks

TL;DR

Abstract

Paper Structure (19 sections, 4 figures)

This paper contains 19 sections, 4 figures.

Introduction
Related work
Protein interface prediction.
Protein docking.
Protein binder design.
Other tasks.
Problems of existing data splits for protein complexes
Splitting by metadata is not enough
Splitting by sequence similarity is not enough
Best practices for data splitting of protein complexes
Splitting by interface similarity is recommended
Human expertise is highly-beneficial
Recommendations
Methods
Comparing protein--protein interactions
...and 4 more sections

Figures (4)

Figure 1: Data leakage in protein--protein interaction splits. Bars show the average percentage of test examples having a nearly identical training example for $90\%/10\%$ splits of 50,000 protein--protein interactions from the Protein Data Bank, with standard deviations (error bars) across 5 random samples. Near duplicates are identified using the iDist algorithm.
Figure 2: Splitting by PDB codes causes data leakage in benchmarks for PPI design. The figure shows three protein complexes taken from SKEMPI v2.0, a standard dataset of annotated PPI mutations. Different chains in the entries are color-coded and labeled with their respective codes. In total, the dataset contains 10 such near-duplicate interactions (PDB codes 3BTD, 3BTE, 3BTT, 3BTM, 3BTQ, 3BTW, 3BTH, 3BTF, 3BTG, 2FTL), representing single-point mutants of the same interaction between a serine protease and its inhibitor krowarsch1999interscaffolding. Recent machine learning research in protein--protein interactions employed PDB-code splitting, resulting in near-duplicate entries, similar to those shown in this figure, scattered across train-validation-test folds.
Figure 3: Splitting by sequence similarity introduces data leakage in benchmarks for protein docking and interface prediction. The figure shows two phosphorylase homooligomers, taken from DIPS, a standard dataset for training and validating machine learning models. The complex to the left (PDB code 1K3F), as well as the complex to the right (1K9S), is composed of five identical proteins (highlighted with colors). Nevertheless, the proteins across the entries have very low sequence similarity (26.5%). Despite the sequences in the complexes being different, the secondary structure of the chains, the topology of the interactions, as well as the 3D structure and the amino acids at the interfaces are highly similar across the entries (iDist $< 0.04$, the near-duplicate threshold; iAlign's p-value $< 10^{-6}$). Recent machine learning research for protein docking and interface prediction employed data splitting based on sequence similarity, resulting in data leakage.
Figure 4: Structural redundancy challenges data splitting of protein interactions. The figure shows five different PDB entries representing the 3D structure of the canavalin trimer ko1993three, with protein chains in different colors. The high level of structural redundancy observed in protein–protein interactions, both within individual entries and across multiple entries, necessitates careful data splitting strategies (for example, the interaction highlighted within the circle is represented in the figure fifteen times, occurring three times in each structure). However, inconsistent metadata and the modular nature of protein structures make naive approaches, such as splitting based on PDB codes or sequence similarity, prone to fail, distributing the same interactions across training-validation-test folds.

Revealing data leakage in protein interaction benchmarks

TL;DR

Abstract

Revealing data leakage in protein interaction benchmarks

Authors

TL;DR

Abstract

Table of Contents

Figures (4)