Diverse mini-batch Active Learning
Fedor Zhdanov
TL;DR
This work tackles the data-labeling bottleneck in supervised learning by proposing Diverse mini-Batch Active Learning (DBAL), a scalable method that jointly optimizes informativeness and diversity when selecting batches of labeled examples. It casts batch selection as a Facility Location-like objective and solves it with weighted K-means to incorporate per-example informativeness, enabling efficient minibatch queries for DL models. Across text and image datasets (Browse Node UK Appliances, 20 Newsgroups, MNIST, CIFAR-10) and models from logistic regression to CNNs, DBAL shows consistent gains over pure uncertainty sampling and far greater scalability than submodular baselines. These results highlight the practical viability of diversity-aware batching for reducing labeling costs in real-world ML pipelines.
Abstract
We study the problem of reducing the amount of labeled training data required to train supervised classification models. We approach it by leveraging Active Learning, through sequential selection of examples which benefit the model most. Selecting examples one by one is not practical for the amount of training examples required by the modern Deep Learning models. We consider the mini-batch Active Learning setting, where several examples are selected at once. We present an approach which takes into account both informativeness of the examples for the model, as well as the diversity of the examples in a mini-batch. By using the well studied K-means clustering algorithm, this approach scales better than the previously proposed approaches, and achieves comparable or better performance.
