Table of Contents
Fetching ...

UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection

Jigang Fan, Quanlin Wu, Shengjie Luo, Liwei Wang

TL;DR

This work tackles biases and fragmentation in ligand binding site detection by introducing UniSite-DS, the first UniProt-centric binding site dataset that aggregates sites across multiple structures per protein. It then proposes UniSite, an end-to-end set-prediction framework with bijective Hungarian matching, offering two variants: UniSite-1D (sequence-only) and UniSite-3D (sequence+structure). A new IoU-based Average Precision evaluation is introduced to better reflect binding-site quality and avoid issues from traditional center-based metrics. Extensive experiments on UniSite-DS and benchmark datasets demonstrate state-of-the-art performance and highlight the dataset’s role in reducing bias, with practical implications for docking and drug design. The work also provides a detailed curation workflow, ablations, and publicly available code and data to enable broader use and further development.

Abstract

The detection of ligand binding sites for proteins is a fundamental step in Structure-Based Drug Design. Despite notable advances in recent years, existing methods, datasets, and evaluation metrics are confronted with several key challenges: (1) current datasets and methods are centered on individual protein-ligand complexes and neglect that diverse binding sites may exist across multiple complexes of the same protein, introducing significant statistical bias; (2) ligand binding site detection is typically modeled as a discontinuous workflow, employing binary segmentation and subsequent clustering algorithms; (3) traditional evaluation metrics do not adequately reflect the actual performance of different binding site prediction methods. To address these issues, we first introduce UniSite-DS, the first UniProt (Unique Protein)-centric ligand binding site dataset, which contains 4.81 times more multi-site data and 2.08 times more overall data compared to the previously most widely used datasets. We then propose UniSite, the first end-to-end ligand binding site detection framework supervised by set prediction loss with bijective matching. In addition, we introduce Average Precision based on Intersection over Union (IoU) as a more accurate evaluation metric for ligand binding site prediction. Extensive experiments on UniSite-DS and several representative benchmark datasets demonstrate that IoU-based Average Precision provides a more accurate reflection of prediction quality, and that UniSite outperforms current state-of-the-art methods in ligand binding site detection. The dataset and codes will be made publicly available at https://github.com/quanlin-wu/unisite.

UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection

TL;DR

This work tackles biases and fragmentation in ligand binding site detection by introducing UniSite-DS, the first UniProt-centric binding site dataset that aggregates sites across multiple structures per protein. It then proposes UniSite, an end-to-end set-prediction framework with bijective Hungarian matching, offering two variants: UniSite-1D (sequence-only) and UniSite-3D (sequence+structure). A new IoU-based Average Precision evaluation is introduced to better reflect binding-site quality and avoid issues from traditional center-based metrics. Extensive experiments on UniSite-DS and benchmark datasets demonstrate state-of-the-art performance and highlight the dataset’s role in reducing bias, with practical implications for docking and drug design. The work also provides a detailed curation workflow, ablations, and publicly available code and data to enable broader use and further development.

Abstract

The detection of ligand binding sites for proteins is a fundamental step in Structure-Based Drug Design. Despite notable advances in recent years, existing methods, datasets, and evaluation metrics are confronted with several key challenges: (1) current datasets and methods are centered on individual protein-ligand complexes and neglect that diverse binding sites may exist across multiple complexes of the same protein, introducing significant statistical bias; (2) ligand binding site detection is typically modeled as a discontinuous workflow, employing binary segmentation and subsequent clustering algorithms; (3) traditional evaluation metrics do not adequately reflect the actual performance of different binding site prediction methods. To address these issues, we first introduce UniSite-DS, the first UniProt (Unique Protein)-centric ligand binding site dataset, which contains 4.81 times more multi-site data and 2.08 times more overall data compared to the previously most widely used datasets. We then propose UniSite, the first end-to-end ligand binding site detection framework supervised by set prediction loss with bijective matching. In addition, we introduce Average Precision based on Intersection over Union (IoU) as a more accurate evaluation metric for ligand binding site prediction. Extensive experiments on UniSite-DS and several representative benchmark datasets demonstrate that IoU-based Average Precision provides a more accurate reflection of prediction quality, and that UniSite outperforms current state-of-the-art methods in ligand binding site detection. The dataset and codes will be made publicly available at https://github.com/quanlin-wu/unisite.

Paper Structure

This paper contains 34 sections, 8 equations, 8 figures, 10 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison between UniSite-DS and previous datasets.(Top left) In PDBbind2020, only one ligand binding site and one structure are recorded for UniProt ID Q8WS26. (Top right) In contrast, UniSite-DS integrates distinct binding sites across all available structures (highly similar, mean TM-Score=0.99), identifying 17 unique ligand binding sites derived from 13 representative PDB entries. (Bottom left and center) Comparison of UniSite-DS with other widely used datasets in terms of multi-site entries and the number of unique proteins. For HOLO4K and COACH420, the most widely used mlig subsets were selected, where each entry corresponds to a PDB structure, while in UniSite-DS, each entry corresponds to a UniProt ID. (Bottom right) Distribution of the number of unique proteins in UniSite-DS with respect to the number of distinct binding sites they contain.
  • Figure 2: Comparison of detection approaches.(Top) Conventional learning-based binding site detection methods typically employ a discontinuous workflow: first predicting binary masks for residues/atoms, then clustering these masks into distinct binding sites. (Bottom) In contrast, our method directly outputs a set of $N$ potentially overlapping binding sites in a single step.
  • Figure 3: The architecture of UniSite. Our models employ an encoder to extract the residue-level features. Then a decoder module is used to generate embeddings of the $N$ predicted binding sites. Finally, the segmentation module outputs $N$ potentially overlapping binding site predictions. The encoder comprises dual pathways: a sequence encoder and an optional structural encoder, allowing UniSite to operate with either sequence-only input or combined sequence-structure information.
  • Figure 4: DCC or DCA failure cases.(A) Repeated counting of the same predicted site since absence of matching. (B) Different ligands bound to the same site lead to deviations in DCC or DCA calculations. (C-D) Failed predictions classified as successful by DCC or DCA but below the IoU threshold.
  • Figure S1: The significant impact of binding site detection on molecular docking.(A) Gold verdonk2003improved defines the binding site using a sphere. (B) AutoDock Vina trott2010autodock defines the binding site using a cube. (C) DeepDock liao2019deepdock and (D) Uni-Mol zhou2023uni identify the binding site by applying a fixed radius around the ligand. (E) Docking success rates on the PoseBusters dataset under different binding site configurations. Docking success rate is defined as the proportion of predictions with an RMSD less than 2Å. Data sourced from umol2024buttenschoen2024posebustersabramson2024accurateiambic_np2.
  • ...and 3 more figures