Table of Contents
Fetching ...

SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction

Yanwen Huang, Bowen Gao, Yinjun Jia, Hongbo Ma, Wei-Ying Ma, Ya-Qin Zhang, Yanyan Lan

TL;DR

This study introduces a comprehensive dataset of small molecule-protein interactions, consisting of over a million binding structures, each annotated with real biological activity labels, designed to facilitate unbiased bioactivity prediction.

Abstract

Small molecules play a pivotal role in modern medicine, and scrutinizing their interactions with protein targets is essential for the discovery and development of novel, life-saving therapeutics. The term "bioactivity" encompasses various biological effects resulting from these interactions, including both binding and functional responses. The magnitude of bioactivity dictates the therapeutic or toxic pharmacological outcomes of small molecules, rendering accurate bioactivity prediction crucial for the development of safe and effective drugs. However, existing structural datasets of small molecule-protein interactions are often limited in scale and lack systematically organized bioactivity labels, thereby impeding our understanding of these interactions and precise bioactivity prediction. In this study, we introduce a comprehensive dataset of small molecule-protein interactions, consisting of over a million binding structures, each annotated with real biological activity labels. This dataset is designed to facilitate unbiased bioactivity prediction. We evaluated several classical models on this dataset, and the results demonstrate that the task of unbiased bioactivity prediction is challenging yet essential.

SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction

TL;DR

This study introduces a comprehensive dataset of small molecule-protein interactions, consisting of over a million binding structures, each annotated with real biological activity labels, designed to facilitate unbiased bioactivity prediction.

Abstract

Small molecules play a pivotal role in modern medicine, and scrutinizing their interactions with protein targets is essential for the discovery and development of novel, life-saving therapeutics. The term "bioactivity" encompasses various biological effects resulting from these interactions, including both binding and functional responses. The magnitude of bioactivity dictates the therapeutic or toxic pharmacological outcomes of small molecules, rendering accurate bioactivity prediction crucial for the development of safe and effective drugs. However, existing structural datasets of small molecule-protein interactions are often limited in scale and lack systematically organized bioactivity labels, thereby impeding our understanding of these interactions and precise bioactivity prediction. In this study, we introduce a comprehensive dataset of small molecule-protein interactions, consisting of over a million binding structures, each annotated with real biological activity labels. This dataset is designed to facilitate unbiased bioactivity prediction. We evaluated several classical models on this dataset, and the results demonstrate that the task of unbiased bioactivity prediction is challenging yet essential.
Paper Structure (35 sections, 5 figures, 4 tables)

This paper contains 35 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Pipeline for SIU construction.(A) Small molecules and protein targets were obtained from corresponding databases, cleaned, and deduplicated. Different small molecules binding to the same protein and different pockets (different PDB IDs) of the same protein were filtered and analyzed. (B) These were then subjected to a multi-software docking pipeline, where the small molecules were prepared and docked to their wet-experiment confirmed targets using three different software programs. The resulting poses were filtered through a voting mechanism to construct the final dataset. (C) The dataset is well-organized and contains multiple pockets for each protein and multiple molecules for each pocket, allowing for downstream tasks to be performed PDB-wisely and assay-type-wisely.
  • Figure 2: Capability of RMSD to quantify differences in docking poses.(A) RMSD 1.544, well-superimposed poses. (B) RMSD 1.985, similar binding modes. (C) RMSD 8.095, fundamentally different binding modes.
  • Figure 3: Filter selection and dataset statistics.(A) Distribution of the number of PDB files per protein target before and after filtering. (B) Influence of RMSD on success and retention ratios. (C) Pairwise t-test p-value differences between the negative logarithmic assay values of four representative assay types, visualized in a heatmap, along with the distribution of the values for each type. (D) Differences in assay values for ten representative protein targets, illustrated by a heatmap of their pairwise t-test p-values, and their distribution.
  • Figure 4: (a) Pearson and Spearman correlations for various label types, calculated both before and after grouping by PDB IDs. (b) Pearson correlations after grouping PDB IDs for different assay types trained on different datasets.
  • Figure 5: Visualization of chemical structure differences among small molecules from the top four assay types using t-SNE with ECFP fingerprints.