UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection
Jigang Fan, Quanlin Wu, Shengjie Luo, Liwei Wang
TL;DR
This work tackles biases and fragmentation in ligand binding site detection by introducing UniSite-DS, the first UniProt-centric binding site dataset that aggregates sites across multiple structures per protein. It then proposes UniSite, an end-to-end set-prediction framework with bijective Hungarian matching, offering two variants: UniSite-1D (sequence-only) and UniSite-3D (sequence+structure). A new IoU-based Average Precision evaluation is introduced to better reflect binding-site quality and avoid issues from traditional center-based metrics. Extensive experiments on UniSite-DS and benchmark datasets demonstrate state-of-the-art performance and highlight the dataset’s role in reducing bias, with practical implications for docking and drug design. The work also provides a detailed curation workflow, ablations, and publicly available code and data to enable broader use and further development.
Abstract
The detection of ligand binding sites for proteins is a fundamental step in Structure-Based Drug Design. Despite notable advances in recent years, existing methods, datasets, and evaluation metrics are confronted with several key challenges: (1) current datasets and methods are centered on individual protein-ligand complexes and neglect that diverse binding sites may exist across multiple complexes of the same protein, introducing significant statistical bias; (2) ligand binding site detection is typically modeled as a discontinuous workflow, employing binary segmentation and subsequent clustering algorithms; (3) traditional evaluation metrics do not adequately reflect the actual performance of different binding site prediction methods. To address these issues, we first introduce UniSite-DS, the first UniProt (Unique Protein)-centric ligand binding site dataset, which contains 4.81 times more multi-site data and 2.08 times more overall data compared to the previously most widely used datasets. We then propose UniSite, the first end-to-end ligand binding site detection framework supervised by set prediction loss with bijective matching. In addition, we introduce Average Precision based on Intersection over Union (IoU) as a more accurate evaluation metric for ligand binding site prediction. Extensive experiments on UniSite-DS and several representative benchmark datasets demonstrate that IoU-based Average Precision provides a more accurate reflection of prediction quality, and that UniSite outperforms current state-of-the-art methods in ligand binding site detection. The dataset and codes will be made publicly available at https://github.com/quanlin-wu/unisite.
