Table of Contents
Fetching ...

Instructor-inspired Machine Learning for Robust Molecular Property Prediction

Fang Wu, Shuting Jin, Siyuan Li, Stan Z. Li

TL;DR

This work tackles data scarcity in molecular property prediction by introducing InstructMol, a semi-supervised framework that deploys an instructor to estimate pseudo-label reliability and steer a target model using abundant unlabeled data without cross-domain transfer. The method combines a confidence-prediction task for the instructor with a cost-sensitive, sample-weighted objective for the predictor, enabling effective use of unlabeled data despite distribution shifts. Across MoleculeNet benchmarks and GOOD OOD datasets, InstructMol achieves state-of-the-art or near-SOTA results, with notable improvements in both in-domain and especially out-of-domain settings, and demonstrates practical utility in real-world drug-discovery cases. The approach also shows strong compatibility with existing pretraining paradigms, suggesting a versatile pathway to robust molecular property prediction under limited labeled data.

Abstract

Machine learning catalyzes a revolution in chemical and biological science. However, its efficacy heavily depends on the availability of labeled data, and annotating biochemical data is extremely laborious. To surmount this data sparsity challenge, we present an instructive learning algorithm named InstructMol to measure pseudo-labels' reliability and help the target model leverage large-scale unlabeled data. InstructMol does not require transferring knowledge between multiple domains, which avoids the potential gap between the pretraining and fine-tuning stages. We demonstrated the high accuracy of InstructMol on several real-world molecular datasets and out-of-distribution (OOD) benchmarks. Code is available at~ https://github.com/smiles724/InstructMol.

Instructor-inspired Machine Learning for Robust Molecular Property Prediction

TL;DR

This work tackles data scarcity in molecular property prediction by introducing InstructMol, a semi-supervised framework that deploys an instructor to estimate pseudo-label reliability and steer a target model using abundant unlabeled data without cross-domain transfer. The method combines a confidence-prediction task for the instructor with a cost-sensitive, sample-weighted objective for the predictor, enabling effective use of unlabeled data despite distribution shifts. Across MoleculeNet benchmarks and GOOD OOD datasets, InstructMol achieves state-of-the-art or near-SOTA results, with notable improvements in both in-domain and especially out-of-domain settings, and demonstrates practical utility in real-world drug-discovery cases. The approach also shows strong compatibility with existing pretraining paradigms, suggesting a versatile pathway to robust molecular property prediction under limited labeled data.

Abstract

Machine learning catalyzes a revolution in chemical and biological science. However, its efficacy heavily depends on the availability of labeled data, and annotating biochemical data is extremely laborious. To surmount this data sparsity challenge, we present an instructive learning algorithm named InstructMol to measure pseudo-labels' reliability and help the target model leverage large-scale unlabeled data. InstructMol does not require transferring knowledge between multiple domains, which avoids the potential gap between the pretraining and fine-tuning stages. We demonstrated the high accuracy of InstructMol on several real-world molecular datasets and out-of-distribution (OOD) benchmarks. Code is available at~ https://github.com/smiles724/InstructMol.
Paper Structure (34 sections, 2 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 34 sections, 2 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Four mainstream paradigms to ameliorate the scarcity of labeled biochemical data. (A) Self-supervised pretraining tasks include masked component modeling, contrastive learning, and auto-encoding. (B) Active learning involves the iterative selection of the most informative data, in which the molecular models are the most uncertain. These samples are then subjected to laboratory testing to determine their labels. This process is repeated with newly labeled data added to the training set. (C) Knowledge graphs are introduced to provide structured relations among multiple drugs and unstructured semantic relations associated with different drug molecules. (D) In SSL, the unlabeled data is used to create a smooth decision boundary between different classes or to estimate the distribution of the input data, while the labeled data is used to provide specific examples of the correct output.
  • Figure 2: The outline of InstructMol. We utilize a pre-trained target molecular model to forecast the properties of unlabeled examples as pseudo-labels. Then, an instructor model predicts the confidence of those pseudo-annotations, which are leveraged to guide the target molecular model to distribute different attention in inferring different data points.
  • Figure 3: The scatter plot of the distributions of LogP predictions for unlabeled data with and without InstructMol. The first row includes predictions before instructive learning, and the second row includes predictions after instructive learning.
  • Figure 4: The influence of unlabeled data size on four tasks.
  • Figure 5: The distributions of confidence scores given by the instructor model during the training process.
  • ...and 3 more figures