Table of Contents
Fetching ...

Active Learning of Molecular Data for Task-Specific Objectives

Kunal Ghosh, Milica Todorović, Aki Vehtari, Patrick Rinke

TL;DR

It is established that the performance of AL depends on the relative distribution of the target molecules in comparison to the total dataset distribution, with the largest computational savings achieved when their overlap is minimal.

Abstract

Active learning (AL) has shown promise for being a particularly data-efficient machine learning approach. Yet, its performance depends on the application and it is not clear when AL practitioners can expect computational savings. Here, we carry out a systematic AL performance assessment for three diverse molecular datasets and two common scientific tasks: compiling compact, informative datasets and targeted molecular searches. We implemented AL with Gaussian processes (GP) and used the many-body tensor as molecular representation. For the first task, we tested different data acquisition strategies, batch sizes and GP noise settings. AL was insensitive to the acquisition batch size and we observed the best AL performance for the acquisition strategy that combines uncertainty reduction with clustering to promote diversity. However, for optimal GP noise settings, AL did not outperform randomized selection of data points. Conversely, for targeted searches, AL outperformed random sampling and achieved data savings up to 64%. Our analysis provides insight into this task-specific performance difference in terms of target distributions and data collection strategies. We established that the performance of AL depends on the relative distribution of the target molecules in comparison to the total dataset distribution, with the largest computational savings achieved when their overlap is minimal.

Active Learning of Molecular Data for Task-Specific Objectives

TL;DR

It is established that the performance of AL depends on the relative distribution of the target molecules in comparison to the total dataset distribution, with the largest computational savings achieved when their overlap is minimal.

Abstract

Active learning (AL) has shown promise for being a particularly data-efficient machine learning approach. Yet, its performance depends on the application and it is not clear when AL practitioners can expect computational savings. Here, we carry out a systematic AL performance assessment for three diverse molecular datasets and two common scientific tasks: compiling compact, informative datasets and targeted molecular searches. We implemented AL with Gaussian processes (GP) and used the many-body tensor as molecular representation. For the first task, we tested different data acquisition strategies, batch sizes and GP noise settings. AL was insensitive to the acquisition batch size and we observed the best AL performance for the acquisition strategy that combines uncertainty reduction with clustering to promote diversity. However, for optimal GP noise settings, AL did not outperform randomized selection of data points. Conversely, for targeted searches, AL outperformed random sampling and achieved data savings up to 64%. Our analysis provides insight into this task-specific performance difference in terms of target distributions and data collection strategies. We established that the performance of AL depends on the relative distribution of the target molecules in comparison to the total dataset distribution, with the largest computational savings achieved when their overlap is minimal.
Paper Structure (7 sections, 8 equations, 9 figures)

This paper contains 7 sections, 8 equations, 9 figures.

Figures (9)

  • Figure 1: Traditionally, materials datasets are curated combinatorially, using human intuition. We propose an AI-assisted dataset curation scheme
  • Figure 2: Illustration of AL steps for a) the active learning iteration and b) the evolution of held-out, train and test set sizes. Before performing active learning, small, labeled training and test sets are compiled. A Gaussian process regression (GPR) model is fitted to the training set and then used to obtain property predictions of the unlabeled held-out set. The acquisition strategy (AS) combines the predicted property, the corresponding prediction uncertainty and molecular representation, to select molecules from the held-out set. Selected molecules are then labeled using ab-initio simulation software (DFT) and added to the training set. The larger training set is used to train a new GPR, and the iteration continues.
  • Figure 3: Illustration of active learning acquisition strategies: A) random; B) utilizing GPR prediction uncertainty; C) by clustering molecular representations; D) by first selecting molecules with high GPR prediction uncertainty and then clustering the selected molecules, selecting the cluster centers; E) by selecting a set of molecules with GPR predicted property lying within a property value range. Subsequently, a random selection is made from the previous set. The round yellow circles indicate molecules. Round circles, with a thick red border, illustrate selected molecules. Dashed lines separate groups of molecules, red dashed line indicate clusters of molecules. The red dot inside a cluster indicates the cluster center, and the arrow illustrates the molecule closest to the cluster center.
  • Figure 4: AL learning curves for Task 1, with test set MAEs computed from GP model predictions as a function of increasing training set size. a) Performance of different AS for the AA dataset with the POW batch scheme. b) Performance of different batch strategies for the AA dataset and AS D. c) Strategy A and D performance on all datasets with $\sigma_n^{2}$=$10^{-10}$. d) Strategy A and D performance on all datasets with $\sigma_n^{2}$=0.05.
  • Figure 5: AL model performance for Task 2. a) Number of correct structures (HOMO > $\varepsilon$) identified by AS A and E, presented as a percentage of the total in-range molecules in the dataset. b) As seen in panel a), AS E requires fewer training examples to achieve the same predictive accuracy as AS A. The plot presents the number of additional in-range molecules identified by AS E relative to AS A, expressed as a percentage of total in-range molecules in each dataset.
  • ...and 4 more figures