Table of Contents
Fetching ...

ALPBench: A Benchmark for Active Learning Pipelines on Tabular Data

Valentin Margraf, Marcel Wever, Sandra Gilhuber, Gabriel Marques Tavares, Thomas Seidl, Eyke Hüllermeier

TL;DR

This work proposes ALPBench, which facilitates the specification, execution, and performance monitoring of active learning pipelines, and has built-in measures to ensure evaluations are done reproducibly, saving exact dataset splits and hyperparameter settings of used algorithms.

Abstract

In settings where only a budgeted amount of labeled data can be afforded, active learning seeks to devise query strategies for selecting the most informative data points to be labeled, aiming to enhance learning algorithms' efficiency and performance. Numerous such query strategies have been proposed and compared in the active learning literature. However, the community still lacks standardized benchmarks for comparing the performance of different query strategies. This particularly holds for the combination of query strategies with different learning algorithms into active learning pipelines and examining the impact of the learning algorithm choice. To close this gap, we propose ALPBench, which facilitates the specification, execution, and performance monitoring of active learning pipelines. It has built-in measures to ensure evaluations are done reproducibly, saving exact dataset splits and hyperparameter settings of used algorithms. In total, ALPBench consists of 86 real-world tabular classification datasets and 5 active learning settings, yielding 430 active learning problems. To demonstrate its usefulness and broad compatibility with various learning algorithms and query strategies, we conduct an exemplary study evaluating 9 query strategies paired with 8 learning algorithms in 2 different settings. We provide ALPBench here: https://github.com/ValentinMargraf/ActiveLearningPipelines.

ALPBench: A Benchmark for Active Learning Pipelines on Tabular Data

TL;DR

This work proposes ALPBench, which facilitates the specification, execution, and performance monitoring of active learning pipelines, and has built-in measures to ensure evaluations are done reproducibly, saving exact dataset splits and hyperparameter settings of used algorithms.

Abstract

In settings where only a budgeted amount of labeled data can be afforded, active learning seeks to devise query strategies for selecting the most informative data points to be labeled, aiming to enhance learning algorithms' efficiency and performance. Numerous such query strategies have been proposed and compared in the active learning literature. However, the community still lacks standardized benchmarks for comparing the performance of different query strategies. This particularly holds for the combination of query strategies with different learning algorithms into active learning pipelines and examining the impact of the learning algorithm choice. To close this gap, we propose ALPBench, which facilitates the specification, execution, and performance monitoring of active learning pipelines. It has built-in measures to ensure evaluations are done reproducibly, saving exact dataset splits and hyperparameter settings of used algorithms. In total, ALPBench consists of 86 real-world tabular classification datasets and 5 active learning settings, yielding 430 active learning problems. To demonstrate its usefulness and broad compatibility with various learning algorithms and query strategies, we conduct an exemplary study evaluating 9 query strategies paired with 8 learning algorithms in 2 different settings. We provide ALPBench here: https://github.com/ValentinMargraf/ActiveLearningPipelines.

Paper Structure

This paper contains 24 sections, 3 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: The contributions of our paper are threefold: (i) the first active learning benchmark considering pipelines of query strategies and learning algorithms, (ii) an extensible Python package for applying and benchmarking active learning pipelines, and (iii) an extensive empirical evaluation of active learning pipelines.
  • Figure 2: Heatmaps for all alp within our evaluation study with statistical significance (first and second subfigure) and without (third and fourth). Information-based, representation-based, and hybrid qs are colored in red, green, and blue, respectively, and random sampling is in purple.
  • Figure 3: Lose-Heatmaps for all alp excluding TabNet without statistical significance, considering binary and multiclass datasets. The color-coding is consistent with Figure \ref{['fig:heatmaps']}.
  • Figure 4: Win-Matrices for different learners (SVM, Catboost and TabPFN) for the small setting considering multi-class datasets with statistical significance.
  • Figure 5: Budget curves for different alp combined out of rf, knn, xgb and rand, ms, coreset and clums on different datasets, considering the small setting.
  • ...and 10 more figures