Table of Contents
Fetching ...

Transfer Learning for Molecular Property Predictions from Small Data Sets

Thorren Kirschbaum, Annika Bande

TL;DR

This study benchmarking common machine learning models for the prediction of molecular properties on two small datasets finds that for the HOPV dataset, the final training results do not improve monotonically with the size of the pre-training dataset, but pre-training with fewer data points can lead to more biased pre-trained models and higher accuracy after fine-tuning.

Abstract

Machine learning has emerged as a new tool in chemistry to bypass expensive experiments or quantum-chemical calculations, for example, in high-throughput screening applications. However, many machine learning studies rely on small data sets, making it difficult to efficiently implement powerful deep learning architectures such as message passing neural networks. In this study, we benchmark common machine learning models for the prediction of molecular properties on two small data sets, for which the best results are obtained with the message passing neural network PaiNN, as well as SOAP molecular descriptors concatenated to a set of simple molecular descriptors tailored to gradient boosting with regression trees. To further improve the predictive capabilities of PaiNN, we present a transfer learning strategy that uses large data sets to pre-train the respective models and allows to obtain more accurate models after fine-tuning on the original data sets. The pre-training labels are obtained from computationally cheap ab initio or semi-empirical models and both data sets are normalized to mean zero and standard deviation one to align the labels' distributions. This study covers two small chemistry data sets, the Harvard Organic Photovoltaics data set (HOPV, HOMO-LUMO-gaps), for which excellent results are obtained, and on the Freesolv data set (solvation energies), where this method is less successful, probably due to a complex underlying learning task and the dissimilar methods used to obtain pre-training and fine-tuning labels. Finally, we find that for the HOPV data set, the final training results do not improve monotonically with the size of the pre-training data set, but pre-training with fewer data points can lead to more biased pre-trained models and higher accuracy after fine-tuning.

Transfer Learning for Molecular Property Predictions from Small Data Sets

TL;DR

This study benchmarking common machine learning models for the prediction of molecular properties on two small datasets finds that for the HOPV dataset, the final training results do not improve monotonically with the size of the pre-training dataset, but pre-training with fewer data points can lead to more biased pre-trained models and higher accuracy after fine-tuning.

Abstract

Machine learning has emerged as a new tool in chemistry to bypass expensive experiments or quantum-chemical calculations, for example, in high-throughput screening applications. However, many machine learning studies rely on small data sets, making it difficult to efficiently implement powerful deep learning architectures such as message passing neural networks. In this study, we benchmark common machine learning models for the prediction of molecular properties on two small data sets, for which the best results are obtained with the message passing neural network PaiNN, as well as SOAP molecular descriptors concatenated to a set of simple molecular descriptors tailored to gradient boosting with regression trees. To further improve the predictive capabilities of PaiNN, we present a transfer learning strategy that uses large data sets to pre-train the respective models and allows to obtain more accurate models after fine-tuning on the original data sets. The pre-training labels are obtained from computationally cheap ab initio or semi-empirical models and both data sets are normalized to mean zero and standard deviation one to align the labels' distributions. This study covers two small chemistry data sets, the Harvard Organic Photovoltaics data set (HOPV, HOMO-LUMO-gaps), for which excellent results are obtained, and on the Freesolv data set (solvation energies), where this method is less successful, probably due to a complex underlying learning task and the dissimilar methods used to obtain pre-training and fine-tuning labels. Finally, we find that for the HOPV data set, the final training results do not improve monotonically with the size of the pre-training data set, but pre-training with fewer data points can lead to more biased pre-trained models and higher accuracy after fine-tuning.
Paper Structure (7 sections, 3 figures, 3 tables)

This paper contains 7 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Scatter plots of the original data plotted against data obtained from XTB (left) or LDA (right) plus linear regression for the data sets HOPV (HOMO-LUMO-gaps, top) and Freesolv (solvation energies, bottom).
  • Figure 2: Fine-tuning learning curves for PaiNN training on HOPV (top) and Freesolv (bottom), after training from scratch (black) or after pre-training on OE62 or QM9 data, respectively, with labels obtained from XTB (red) and LDA-DFT (blue). The MAE (mean and standard deviation over five runs) is plotted against the number of training examples used for fine-tuning (log scale).
  • Figure 3: Pre-training learning curves for PaiNN training on HOPV (top) and Freesolv (bottom), after training from scratch (black star) or after pre-training on OE62 or QM9 data, respectively, with labels obtained XTB (red) and LDA-DFT (blue). The MAE (mean and standard deviation over five runs) is plotted against the number of training examples used for pre-training.