Classification Tree-based Active Learning: A Wrapper Approach
Ashna Jose, Emilie Devijver, Massih-Reza Amini, Noel Jakse, Roberta Poloni
TL;DR
This paper addresses the challenge of labeling cost in multi-class classification by introducing CT-AL, a wrapper active learning method that uses a classification tree to partition the input-output space into homogeneous leaves. New labels are selected from leaves based on purity and density, with entropy guiding impurity handling, and a diversity-representativeness criterion refining within-leaf samples. Empirical results on six benchmark datasets show that CT-AL, particularly the div-rep variant, consistently outperforms random sampling and several state-of-the-art AL methods, especially in imbalanced and multi-class settings, while maintaining low variance. The approach offers a practical, scalable strategy for obtaining high-accuracy models from very small labeled sets and points to future enhancements via semi-supervised learning, transfer learning, ensembles, and robustness to noise.
Abstract
Supervised machine learning often requires large training sets to train accurate models, yet obtaining large amounts of labeled data is not always feasible. Hence, it becomes crucial to explore active learning methods for reducing the size of training sets while maintaining high accuracy. The aim is to select the optimal subset of data for labeling from an initial unlabeled set, ensuring precise prediction of outcomes. However, conventional active learning approaches are comparable to classical random sampling. This paper proposes a wrapper active learning method for classification, organizing the sampling process into a tree structure, that improves state-of-the-art algorithms. A classification tree constructed on an initial set of labeled samples is considered to decompose the space into low-entropy regions. Input-space based criteria are used thereafter to sub-sample from these regions, the total number of points to be labeled being decomposed into each region. This adaptation proves to be a significant enhancement over existing active learning methods. Through experiments conducted on various benchmark data sets, the paper demonstrates the efficacy of the proposed framework by being effective in constructing accurate classification models, even when provided with a severely restricted labeled data set.
