Table of Contents
Fetching ...

PyRelationAL: a python library for active learning research and development

Paul Scherer, Alison Pouplin, Alice Del Vecchio, Suraj M S, Oliver Bolton, Jyothish Soman, Jake P. Taylor-King, Lindsay Edwards, Thomas Gaudelet

TL;DR

A modular toolkit based around a two step design methodology for composing pool-based active learning strategies applicable to both single-acquisition and batch-acquisition strategies, which allows for the mathematical and practical specification of a broad number of existing and novel strategies under a consistent programming model and abstraction.

Abstract

Active learning (AL) is a sub-field of ML focused on the development of methods to iteratively and economically acquire data by strategically querying new data points that are the most useful for a particular task. Here, we introduce PyRelationAL, an open source library for AL research. We describe a modular toolkit based around a two step design methodology for composing pool-based active learning strategies applicable to both single-acquisition and batch-acquisition strategies. This framework allows for the mathematical and practical specification of a broad number of existing and novel strategies under a consistent programming model and abstraction. Furthermore, we incorporate datasets and active learning tasks applicable to them to simplify comparative evaluation and benchmarking, along with an initial group of benchmarks across datasets included in this library. The toolkit is compatible with existing ML frameworks. PyRelationAL is maintained using modern software engineering practices -- with an inclusive contributor code of conduct -- to promote long term library quality and utilisation. PyRelationAL is available under a permissive Apache licence on PyPi and at https://github.com/RelationRx/pyrelational.

PyRelationAL: a python library for active learning research and development

TL;DR

A modular toolkit based around a two step design methodology for composing pool-based active learning strategies applicable to both single-acquisition and batch-acquisition strategies, which allows for the mathematical and practical specification of a broad number of existing and novel strategies under a consistent programming model and abstraction.

Abstract

Active learning (AL) is a sub-field of ML focused on the development of methods to iteratively and economically acquire data by strategically querying new data points that are the most useful for a particular task. Here, we introduce PyRelationAL, an open source library for AL research. We describe a modular toolkit based around a two step design methodology for composing pool-based active learning strategies applicable to both single-acquisition and batch-acquisition strategies. This framework allows for the mathematical and practical specification of a broad number of existing and novel strategies under a consistent programming model and abstraction. Furthermore, we incorporate datasets and active learning tasks applicable to them to simplify comparative evaluation and benchmarking, along with an initial group of benchmarks across datasets included in this library. The toolkit is compatible with existing ML frameworks. PyRelationAL is maintained using modern software engineering practices -- with an inclusive contributor code of conduct -- to promote long term library quality and utilisation. PyRelationAL is available under a permissive Apache licence on PyPi and at https://github.com/RelationRx/pyrelational.
Paper Structure (36 sections, 9 equations, 2 figures, 3 tables)

This paper contains 36 sections, 9 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Diagram of PyrelationAL's modular approach to constructing full active learning pipelines.
  • Figure 2: Four samples of benchmark results obtained for regression (top row) and classification (bottom row) scenarios. Note the different initialisations, single or batch-mode acquisition, and task models, being utilised for the strategies. (A) SynthReg2 dataset, regression with MLP ensemble with bagging, warm start, single acquisition. (B) Energy, regression with GP, cold start, single acquisition. (C) CreditCard, binary classification with random forest, warm start, single acquisition. (D) MNIST, multi-class classification with CNN+MCDropOut, warm start, Top-K batch acquisition of K=10. In each panel, the legend reports the area under the curve as a global metric. It is important to remark that depending on the metric, a low AUC might be preferable, in particular for regression where we measure by mean square error we favour a lower AUC for the metric.