On the Fragility of Active Learners for Text Classification

Abhishek Ghose; Emma Thuong Nguyen

On the Fragility of Active Learners for Text Classification

Abhishek Ghose, Emma Thuong Nguyen

TL;DR

A rigorous evaluation of AL techniques, old and new, across setups that vary wrt datasets, text representations and classifiers is released, which unlocks multiple insights around warm-up times, i.e., number of labels before gains from AL are seen, viability of an “Always ON” mode and the relative significance of different factors.

Abstract

Active learning (AL) techniques optimally utilize a labeling budget by iteratively selecting instances that are most valuable for learning. However, they lack ``prerequisite checks'', i.e., there are no prescribed criteria to pick an AL algorithm best suited for a dataset. A practitioner must pick a technique they \emph{trust} would beat random sampling, based on prior reported results, and hope that it is resilient to the many variables in their environment: dataset, labeling budget and prediction pipelines. The important questions then are: how often on average, do we expect any AL technique to reliably beat the computationally cheap and easy-to-implement strategy of random sampling? Does it at least make sense to use AL in an ``Always ON'' mode in a prediction pipeline, so that while it might not always help, it never under-performs random sampling? How much of a role does the prediction pipeline play in AL's success? We examine these questions in detail for the task of text classification using pre-trained representations, which are ubiquitous today. Our primary contribution here is a rigorous evaluation of AL techniques, old and new, across setups that vary wrt datasets, text representations and classifiers. This unlocks multiple insights around warm-up times, i.e., number of labels before gains from AL are seen, viability of an ``Always ON'' mode and the relative significance of different factors. Additionally, we release a framework for rigorous benchmarking of AL techniques for text classification.

On the Fragility of Active Learners for Text Classification

TL;DR

Abstract

Paper Structure (26 sections, 1 equation, 6 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 1 equation, 6 figures, 5 tables, 1 algorithm.

Introduction
Previous Work
Batch Active Learning - Overview
Experiment Setup
Configuration Space of Experiments
Metrics and Other Settings
Notation and Terminology
Decision Model
Results
Expected Gains from AL
Always ON Mode
Effect of Prediction Pipeline vs QS
Effect of Batch/Seed Size
Effect of Representation
Summary and Conclusion
...and 11 more sections

Figures (6)

Figure 1: The space of experiments is shown. See § \ref{['sec:method']} for description. All representations are produced by pre-trained models, which are ubiquitous in practice today. The lines between the boxes "Representation" and "Classifier" denote combinations that constitute our prediction pipelines. Note that RoBERTa is an end-to-end predictor, where there are no separate representation and classification steps. Also note that the popular Transformer architecture NIPS2017_3f5ee243 is represented by RoBERTa and MPNet here.
Figure 2: F1 macro scores on the test set at each iteration, for the dataset agnews and batch size of $200$. The $x$-axes show size of the labeled data, the $y$-axes show the F1-macro scores on the test data.
Figure 3: Expected relative improvement in F1-macro score over random. (a)-(e) show this for different predictors and QS, at different training sizes (see titles). These correspond to Equation \ref{['eqn:avg_gain']}. (f) and (g) show marginalized improvements for different predictors and QSes respectively; see equations \ref{['eqn:avg_gain_pipelines']} and \ref{['eqn:avg_gain_QS']}.
Figure 4: Effect of text representations on the relative improvement.
Figure 5: Expectation over variance of F1-macro given a pipeline and dataset, plotted against size of labeled data. Note that the batch/side sizes don't strongly influence trends.
...and 1 more figures

On the Fragility of Active Learners for Text Classification

TL;DR

Abstract

On the Fragility of Active Learners for Text Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (6)