SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction

Yanwen Huang; Bowen Gao; Yinjun Jia; Hongbo Ma; Wei-Ying Ma; Ya-Qin Zhang; Yanyan Lan

SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction

Yanwen Huang, Bowen Gao, Yinjun Jia, Hongbo Ma, Wei-Ying Ma, Ya-Qin Zhang, Yanyan Lan

TL;DR

This study introduces a comprehensive dataset of small molecule-protein interactions, consisting of over a million binding structures, each annotated with real biological activity labels, designed to facilitate unbiased bioactivity prediction.

Abstract

Small molecules play a pivotal role in modern medicine, and scrutinizing their interactions with protein targets is essential for the discovery and development of novel, life-saving therapeutics. The term "bioactivity" encompasses various biological effects resulting from these interactions, including both binding and functional responses. The magnitude of bioactivity dictates the therapeutic or toxic pharmacological outcomes of small molecules, rendering accurate bioactivity prediction crucial for the development of safe and effective drugs. However, existing structural datasets of small molecule-protein interactions are often limited in scale and lack systematically organized bioactivity labels, thereby impeding our understanding of these interactions and precise bioactivity prediction. In this study, we introduce a comprehensive dataset of small molecule-protein interactions, consisting of over a million binding structures, each annotated with real biological activity labels. This dataset is designed to facilitate unbiased bioactivity prediction. We evaluated several classical models on this dataset, and the results demonstrate that the task of unbiased bioactivity prediction is challenging yet essential.

SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction

TL;DR

Abstract

Paper Structure (35 sections, 5 figures, 4 tables)

This paper contains 35 sections, 5 figures, 4 tables.

Introduction
Related work
Non-structural datasets on drug-target interaction for bioactivity prediction.
Structural datasets based on experimental structures for bioactivity prediction.
Structural datasets based on modeling structures for bioactivity prediction.
SIU dataset construction and overview
Methods
Data cleaning and deduplication
Bioactivity data extracting.
PDB structure retrieval and mapping.
Structural data construction via multi-software docking
Molecular docking.
Consensus filtering of docking poses.
Data construction for downstream tasks
Dataset organization for unbiased bioactivity prediciton.
...and 20 more sections

Figures (5)

Figure 1: Pipeline for SIU construction.(A) Small molecules and protein targets were obtained from corresponding databases, cleaned, and deduplicated. Different small molecules binding to the same protein and different pockets (different PDB IDs) of the same protein were filtered and analyzed. (B) These were then subjected to a multi-software docking pipeline, where the small molecules were prepared and docked to their wet-experiment confirmed targets using three different software programs. The resulting poses were filtered through a voting mechanism to construct the final dataset. (C) The dataset is well-organized and contains multiple pockets for each protein and multiple molecules for each pocket, allowing for downstream tasks to be performed PDB-wisely and assay-type-wisely.
Figure 2: Capability of RMSD to quantify differences in docking poses.(A) RMSD 1.544, well-superimposed poses. (B) RMSD 1.985, similar binding modes. (C) RMSD 8.095, fundamentally different binding modes.
Figure 3: Filter selection and dataset statistics.(A) Distribution of the number of PDB files per protein target before and after filtering. (B) Influence of RMSD on success and retention ratios. (C) Pairwise t-test p-value differences between the negative logarithmic assay values of four representative assay types, visualized in a heatmap, along with the distribution of the values for each type. (D) Differences in assay values for ten representative protein targets, illustrated by a heatmap of their pairwise t-test p-values, and their distribution.
Figure 4: (a) Pearson and Spearman correlations for various label types, calculated both before and after grouping by PDB IDs. (b) Pearson correlations after grouping PDB IDs for different assay types trained on different datasets.
Figure 5: Visualization of chemical structure differences among small molecules from the top four assay types using t-SNE with ECFP fingerprints.

SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction

TL;DR

Abstract

SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (5)