Table of Contents
Fetching ...

On the Fragility of Active Learners for Text Classification

Abhishek Ghose, Emma Thuong Nguyen

TL;DR

A rigorous evaluation of AL techniques, old and new, across setups that vary wrt datasets, text representations and classifiers is released, which unlocks multiple insights around warm-up times, i.e., number of labels before gains from AL are seen, viability of an “Always ON” mode and the relative significance of different factors.

Abstract

Active learning (AL) techniques optimally utilize a labeling budget by iteratively selecting instances that are most valuable for learning. However, they lack ``prerequisite checks'', i.e., there are no prescribed criteria to pick an AL algorithm best suited for a dataset. A practitioner must pick a technique they \emph{trust} would beat random sampling, based on prior reported results, and hope that it is resilient to the many variables in their environment: dataset, labeling budget and prediction pipelines. The important questions then are: how often on average, do we expect any AL technique to reliably beat the computationally cheap and easy-to-implement strategy of random sampling? Does it at least make sense to use AL in an ``Always ON'' mode in a prediction pipeline, so that while it might not always help, it never under-performs random sampling? How much of a role does the prediction pipeline play in AL's success? We examine these questions in detail for the task of text classification using pre-trained representations, which are ubiquitous today. Our primary contribution here is a rigorous evaluation of AL techniques, old and new, across setups that vary wrt datasets, text representations and classifiers. This unlocks multiple insights around warm-up times, i.e., number of labels before gains from AL are seen, viability of an ``Always ON'' mode and the relative significance of different factors. Additionally, we release a framework for rigorous benchmarking of AL techniques for text classification.

On the Fragility of Active Learners for Text Classification

TL;DR

A rigorous evaluation of AL techniques, old and new, across setups that vary wrt datasets, text representations and classifiers is released, which unlocks multiple insights around warm-up times, i.e., number of labels before gains from AL are seen, viability of an “Always ON” mode and the relative significance of different factors.

Abstract

Active learning (AL) techniques optimally utilize a labeling budget by iteratively selecting instances that are most valuable for learning. However, they lack ``prerequisite checks'', i.e., there are no prescribed criteria to pick an AL algorithm best suited for a dataset. A practitioner must pick a technique they \emph{trust} would beat random sampling, based on prior reported results, and hope that it is resilient to the many variables in their environment: dataset, labeling budget and prediction pipelines. The important questions then are: how often on average, do we expect any AL technique to reliably beat the computationally cheap and easy-to-implement strategy of random sampling? Does it at least make sense to use AL in an ``Always ON'' mode in a prediction pipeline, so that while it might not always help, it never under-performs random sampling? How much of a role does the prediction pipeline play in AL's success? We examine these questions in detail for the task of text classification using pre-trained representations, which are ubiquitous today. Our primary contribution here is a rigorous evaluation of AL techniques, old and new, across setups that vary wrt datasets, text representations and classifiers. This unlocks multiple insights around warm-up times, i.e., number of labels before gains from AL are seen, viability of an ``Always ON'' mode and the relative significance of different factors. Additionally, we release a framework for rigorous benchmarking of AL techniques for text classification.
Paper Structure (26 sections, 1 equation, 6 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 1 equation, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: The space of experiments is shown. See § \ref{['sec:method']} for description. All representations are produced by pre-trained models, which are ubiquitous in practice today. The lines between the boxes "Representation" and "Classifier" denote combinations that constitute our prediction pipelines. Note that RoBERTa is an end-to-end predictor, where there are no separate representation and classification steps. Also note that the popular Transformer architecture NIPS2017_3f5ee243 is represented by RoBERTa and MPNet here.
  • Figure 2: F1 macro scores on the test set at each iteration, for the dataset agnews and batch size of $200$. The $x$-axes show size of the labeled data, the $y$-axes show the F1-macro scores on the test data.
  • Figure 3: Expected relative improvement in F1-macro score over random. (a)-(e) show this for different predictors and QS, at different training sizes (see titles). These correspond to Equation \ref{['eqn:avg_gain']}. (f) and (g) show marginalized improvements for different predictors and QSes respectively; see equations \ref{['eqn:avg_gain_pipelines']} and \ref{['eqn:avg_gain_QS']}.
  • Figure 4: Effect of text representations on the relative improvement.
  • Figure 5: Expectation over variance of F1-macro given a pipeline and dataset, plotted against size of labeled data. Note that the batch/side sizes don't strongly influence trends.
  • ...and 1 more figures