Table of Contents
Fetching ...

regAL: Python Package for Active Learning of Regression Problems

Elizaveta Surzhikova, Jonny Proppe

TL;DR

This work presents the Python package regAL, which allows users to evaluate different active learning strategies for regression problems, and is intended for anyone who aims to perform and understand active learning in their problem-specific scope.

Abstract

Increasingly more research areas rely on machine learning methods to accelerate discovery while saving resources. Machine learning models, however, usually require large datasets of experimental or computational results, which in certain fields, such as (bio)chemistry, materials science, or medicine, are rarely given and often prohibitively expensive to obtain. To bypass that obstacle, active learning methods are employed to develop machine learning models with a desired performance while requiring the least possible number of computational or experimental results from the domain of application. For this purpose, the model's knowledge about certain regions of the application domain is estimated to guide the choice of the model's training set. Although active learning is widely studied for classification problems (discrete outcomes), comparatively few works handle this method for regression problems (continuous outcomes). In this work, we present our Python package regAL, which allows users to evaluate different active learning strategies for regression problems. With a minimal input of just the dataset in question, but many additional customization and insight options, this package is intended for anyone who aims to perform and understand active learning in their problem-specific scope.

regAL: Python Package for Active Learning of Regression Problems

TL;DR

This work presents the Python package regAL, which allows users to evaluate different active learning strategies for regression problems, and is intended for anyone who aims to perform and understand active learning in their problem-specific scope.

Abstract

Increasingly more research areas rely on machine learning methods to accelerate discovery while saving resources. Machine learning models, however, usually require large datasets of experimental or computational results, which in certain fields, such as (bio)chemistry, materials science, or medicine, are rarely given and often prohibitively expensive to obtain. To bypass that obstacle, active learning methods are employed to develop machine learning models with a desired performance while requiring the least possible number of computational or experimental results from the domain of application. For this purpose, the model's knowledge about certain regions of the application domain is estimated to guide the choice of the model's training set. Although active learning is widely studied for classification problems (discrete outcomes), comparatively few works handle this method for regression problems (continuous outcomes). In this work, we present our Python package regAL, which allows users to evaluate different active learning strategies for regression problems. With a minimal input of just the dataset in question, but many additional customization and insight options, this package is intended for anyone who aims to perform and understand active learning in their problem-specific scope.

Paper Structure

This paper contains 11 sections, 8 equations, 10 figures.

Figures (10)

  • Figure 1: Exemplary spaces with data samples. On the left in green, samples which contribute highly to the model's understanding of the space are shown. On the right, redundant sample pairs, of which the second sample provides negligible additional information to the model after the first sample is labeled, are shown in yellow.
  • Figure 2: Representation of all sample selection methods currently implemented in regAL. The red stars constitute the training set of the model, while the dots are the unknown pool. The color of the dots indicates their priority to be labeled, with yellow and blue representing the highest and lowest priorities, respectively. Note that this ranking is only valid for this particular iteration and will change after the model is retrained.
  • Figure 3: Illustration of benchmark and learn mode procedures in regAL. In benchmark mode, the labels of all samples are known from the start, but are only incrementally revealed to the model. In learn mode, labels are known only for some samples, and new labels are generated by an oracle, which can be user inputs, rule-based decisions, results of experiments or computer simulations, etc.
  • Figure 4: Visualization of the QM9 subset used for the performance example (see also Fig. \ref{['fig:output_plot']}). 1340 points were uniformly sampled from the QM9 dataset, out of which 40 were randomly chosen to be in the initial training set, shown by red stars.
  • Figure 5: Examplary regAL output in benchmark mode, showing the progression of the root mean squared error (RMSE) over the number of active learning cycles for a subset of the QM9 dataset (target: HOMO energy ($E_\text{HOMO}$), features: invariant two-body interactions descriptor $F_\text{2B}$). We abbreviated uncertainty sampling with UNC, covariance sampling with COV and random sampling with RND.
  • ...and 5 more figures