Instructor-inspired Machine Learning for Robust Molecular Property Prediction
Fang Wu, Shuting Jin, Siyuan Li, Stan Z. Li
TL;DR
This work tackles data scarcity in molecular property prediction by introducing InstructMol, a semi-supervised framework that deploys an instructor to estimate pseudo-label reliability and steer a target model using abundant unlabeled data without cross-domain transfer. The method combines a confidence-prediction task for the instructor with a cost-sensitive, sample-weighted objective for the predictor, enabling effective use of unlabeled data despite distribution shifts. Across MoleculeNet benchmarks and GOOD OOD datasets, InstructMol achieves state-of-the-art or near-SOTA results, with notable improvements in both in-domain and especially out-of-domain settings, and demonstrates practical utility in real-world drug-discovery cases. The approach also shows strong compatibility with existing pretraining paradigms, suggesting a versatile pathway to robust molecular property prediction under limited labeled data.
Abstract
Machine learning catalyzes a revolution in chemical and biological science. However, its efficacy heavily depends on the availability of labeled data, and annotating biochemical data is extremely laborious. To surmount this data sparsity challenge, we present an instructive learning algorithm named InstructMol to measure pseudo-labels' reliability and help the target model leverage large-scale unlabeled data. InstructMol does not require transferring knowledge between multiple domains, which avoids the potential gap between the pretraining and fine-tuning stages. We demonstrated the high accuracy of InstructMol on several real-world molecular datasets and out-of-distribution (OOD) benchmarks. Code is available at~ https://github.com/smiles724/InstructMol.
