Table of Contents
Fetching ...

MoleculeCLA: Rethinking Molecular Benchmark via Computational Ligand-Target Binding Analysis

Shikun Feng, Jiaxin Zheng, Yinjun Jia, Yanwen Huang, Fengfeng Zhou, Wei-Ying Ma, Yanyan Lan

TL;DR

This work constructs a large-scale and precise molecular representation dataset of approximately 140,000 small molecules, meticulously designed to capture an extensive array of chemical, physical, and biological properties, derived through a robust computational ligand-target binding analysis pipeline.

Abstract

Molecular representation learning is pivotal for various molecular property prediction tasks related to drug discovery. Robust and accurate benchmarks are essential for refining and validating current methods. Existing molecular property benchmarks derived from wet experiments, however, face limitations such as data volume constraints, unbalanced label distribution, and noisy labels. To address these issues, we construct a large-scale and precise molecular representation dataset of approximately 140,000 small molecules, meticulously designed to capture an extensive array of chemical, physical, and biological properties, derived through a robust computational ligand-target binding analysis pipeline. We conduct extensive experiments on various deep learning models, demonstrating that our dataset offers significant physicochemical interpretability to guide model development and design. Notably, the dataset's properties are linked to binding affinity metrics, providing additional insights into model performance in drug-target interaction tasks. We believe this dataset will serve as a more accurate and reliable benchmark for molecular representation learning, thereby expediting progress in the field of artificial intelligence-driven drug discovery.

MoleculeCLA: Rethinking Molecular Benchmark via Computational Ligand-Target Binding Analysis

TL;DR

This work constructs a large-scale and precise molecular representation dataset of approximately 140,000 small molecules, meticulously designed to capture an extensive array of chemical, physical, and biological properties, derived through a robust computational ligand-target binding analysis pipeline.

Abstract

Molecular representation learning is pivotal for various molecular property prediction tasks related to drug discovery. Robust and accurate benchmarks are essential for refining and validating current methods. Existing molecular property benchmarks derived from wet experiments, however, face limitations such as data volume constraints, unbalanced label distribution, and noisy labels. To address these issues, we construct a large-scale and precise molecular representation dataset of approximately 140,000 small molecules, meticulously designed to capture an extensive array of chemical, physical, and biological properties, derived through a robust computational ligand-target binding analysis pipeline. We conduct extensive experiments on various deep learning models, demonstrating that our dataset offers significant physicochemical interpretability to guide model development and design. Notably, the dataset's properties are linked to binding affinity metrics, providing additional insights into model performance in drug-target interaction tasks. We believe this dataset will serve as a more accurate and reliable benchmark for molecular representation learning, thereby expediting progress in the field of artificial intelligence-driven drug discovery.

Paper Structure

This paper contains 33 sections, 3 figures, 12 tables.

Figures (3)

  • Figure 1: The statistical analysis of data numbers and label distribution about tasks in MoleculeNet. (a) indicates that the majority of task datasets consist of fewer than 10,000 entries. (b) illustrates the label distribution across all subtasks within each classification task. It is obvious that the proportions of samples with a label value of 1 show a bias towards either extreme, indicating a significant imbalance issue in MoleculeNet's label distribution.
  • Figure 2: The overview of MoleculeCLA: diverse categories of molecular properties are derived from the computation binding analysis. We assess methods like deep learning models and descriptors through linear prob, MLP, and fine-tuning testing protocols. Results are presented via multiple regression task metrics.
  • Figure 3: Data analysis of MoleculeCLA: (a) The t-SNE visualization of fingerprint clustering across various datasets, including MoleculeCLA, PCBA, MoleculeACE, KIBA, Davis, and LBA, reveals that despite containing approximately one-third the number of samples, MoleculeCLA demonstrates a chemical space comparable to PCBA. (b) The Pearson correlation matrix among tasks within MoleculeCLA showcases the diversity of different properties. (c) Examining the label value distribution across all nine tasks, most tasks exhibit smooth distributions, with the exception of esite and hbond.