Table of Contents
Fetching ...

Classification Tree-based Active Learning: A Wrapper Approach

Ashna Jose, Emilie Devijver, Massih-Reza Amini, Noel Jakse, Roberta Poloni

TL;DR

This paper addresses the challenge of labeling cost in multi-class classification by introducing CT-AL, a wrapper active learning method that uses a classification tree to partition the input-output space into homogeneous leaves. New labels are selected from leaves based on purity and density, with entropy guiding impurity handling, and a diversity-representativeness criterion refining within-leaf samples. Empirical results on six benchmark datasets show that CT-AL, particularly the div-rep variant, consistently outperforms random sampling and several state-of-the-art AL methods, especially in imbalanced and multi-class settings, while maintaining low variance. The approach offers a practical, scalable strategy for obtaining high-accuracy models from very small labeled sets and points to future enhancements via semi-supervised learning, transfer learning, ensembles, and robustness to noise.

Abstract

Supervised machine learning often requires large training sets to train accurate models, yet obtaining large amounts of labeled data is not always feasible. Hence, it becomes crucial to explore active learning methods for reducing the size of training sets while maintaining high accuracy. The aim is to select the optimal subset of data for labeling from an initial unlabeled set, ensuring precise prediction of outcomes. However, conventional active learning approaches are comparable to classical random sampling. This paper proposes a wrapper active learning method for classification, organizing the sampling process into a tree structure, that improves state-of-the-art algorithms. A classification tree constructed on an initial set of labeled samples is considered to decompose the space into low-entropy regions. Input-space based criteria are used thereafter to sub-sample from these regions, the total number of points to be labeled being decomposed into each region. This adaptation proves to be a significant enhancement over existing active learning methods. Through experiments conducted on various benchmark data sets, the paper demonstrates the efficacy of the proposed framework by being effective in constructing accurate classification models, even when provided with a severely restricted labeled data set.

Classification Tree-based Active Learning: A Wrapper Approach

TL;DR

This paper addresses the challenge of labeling cost in multi-class classification by introducing CT-AL, a wrapper active learning method that uses a classification tree to partition the input-output space into homogeneous leaves. New labels are selected from leaves based on purity and density, with entropy guiding impurity handling, and a diversity-representativeness criterion refining within-leaf samples. Empirical results on six benchmark datasets show that CT-AL, particularly the div-rep variant, consistently outperforms random sampling and several state-of-the-art AL methods, especially in imbalanced and multi-class settings, while maintaining low variance. The approach offers a practical, scalable strategy for obtaining high-accuracy models from very small labeled sets and points to future enhancements via semi-supervised learning, transfer learning, ensembles, and robustness to noise.

Abstract

Supervised machine learning often requires large training sets to train accurate models, yet obtaining large amounts of labeled data is not always feasible. Hence, it becomes crucial to explore active learning methods for reducing the size of training sets while maintaining high accuracy. The aim is to select the optimal subset of data for labeling from an initial unlabeled set, ensuring precise prediction of outcomes. However, conventional active learning approaches are comparable to classical random sampling. This paper proposes a wrapper active learning method for classification, organizing the sampling process into a tree structure, that improves state-of-the-art algorithms. A classification tree constructed on an initial set of labeled samples is considered to decompose the space into low-entropy regions. Input-space based criteria are used thereafter to sub-sample from these regions, the total number of points to be labeled being decomposed into each region. This adaptation proves to be a significant enhancement over existing active learning methods. Through experiments conducted on various benchmark data sets, the paper demonstrates the efficacy of the proposed framework by being effective in constructing accurate classification models, even when provided with a severely restricted labeled data set.
Paper Structure (16 sections, 12 equations, 3 figures, 2 tables, 2 algorithms)

This paper contains 16 sections, 12 equations, 3 figures, 2 tables, 2 algorithms.

Figures (3)

  • Figure 1: Flowchart of the proposed method. Blue circles correspond to model-free steps, while magenta dashed rectangles correspond to model-based steps. The yellow pentagon defines the criteria to query new samples, which is detailed in Section \ref{['sec:ct-al']}. The whole budget $n$ is divided into the first initial points, $n_{\text{init}}$, and the new points to be labeled, $n_{\text{act}}$.
  • Figure 2: A classification tree learnt on labeled samples is depicted as an illustration of the different kind of nodes. Here, we consider the binary case, with classes 1 and 2. Leaves are classified as 'pure' when all the true labels in a regions belong to the same class, as depicted in green, and as 'impure' when there exist different classes among the labels, shown in blue. All true labels classes are shown as black, while unlabeled samples are shown as red numbers. The dark and light shades represent the high and low density regions of unlabeled samples, respectively. The $n_k^*$ below each leaf are the number of new samples to be labeled from those regions using CT-AL.
  • Figure 3: Performance in prediction using balanced accuracy score, averaged over 100 runs for different train-test splits, when the training set is constructed using random sampling (RS) and CT-AL using different criteria to sample from the leaf regions, for 6 different data sets. CT-AL with random sampling from the leaves is shown as CT-AL (RS), while CT-AL with diversity-representativity criterion is shown as CT-AL (div-rep). The training set size varies from 20 to 200.