Table of Contents
Fetching ...

Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models

Christopher Schröder, Gerhard Heyer

TL;DR

This work investigates how self-training, a semi-supervised approach that uses a model to obtain pseudo-labels for unlabeled data, can be used to improve the efficiency of active learning for text classification and introduces HAST, a new and effective self-training strategy, which is evaluated on four text classification benchmarks.

Abstract

Active learning is an iterative labeling process that is used to obtain a small labeled subset, despite the absence of labeled data, thereby enabling to train a model for supervised tasks such as text classification. While active learning has made considerable progress in recent years due to improvements provided by pre-trained language models, there is untapped potential in the often neglected unlabeled portion of the data, although it is available in considerably larger quantities than the usually small set of labeled data. In this work, we investigate how self-training, a semi-supervised approach that uses a model to obtain pseudo-labels for unlabeled data, can be used to improve the efficiency of active learning for text classification. Building on a comprehensive reproduction of four previous self-training approaches, some of which are evaluated for the first time in the context of active learning or natural language processing, we introduce HAST, a new and effective self-training strategy, which is evaluated on four text classification benchmarks. Our results show that it outperforms the reproduced self-training approaches and reaches classification results comparable to previous experiments for three out of four datasets, using as little as 25% of the data. The code is publicly available at https://github.com/chschroeder/self-training-for-sample-efficient-active-learning .

Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models

TL;DR

This work investigates how self-training, a semi-supervised approach that uses a model to obtain pseudo-labels for unlabeled data, can be used to improve the efficiency of active learning for text classification and introduces HAST, a new and effective self-training strategy, which is evaluated on four text classification benchmarks.

Abstract

Active learning is an iterative labeling process that is used to obtain a small labeled subset, despite the absence of labeled data, thereby enabling to train a model for supervised tasks such as text classification. While active learning has made considerable progress in recent years due to improvements provided by pre-trained language models, there is untapped potential in the often neglected unlabeled portion of the data, although it is available in considerably larger quantities than the usually small set of labeled data. In this work, we investigate how self-training, a semi-supervised approach that uses a model to obtain pseudo-labels for unlabeled data, can be used to improve the efficiency of active learning for text classification. Building on a comprehensive reproduction of four previous self-training approaches, some of which are evaluated for the first time in the context of active learning or natural language processing, we introduce HAST, a new and effective self-training strategy, which is evaluated on four text classification benchmarks. Our results show that it outperforms the reproduced self-training approaches and reaches classification results comparable to previous experiments for three out of four datasets, using as little as 25% of the data. The code is publicly available at https://github.com/chschroeder/self-training-for-sample-efficient-active-learning .
Paper Structure (52 sections, 4 equations, 3 figures, 9 tables, 2 algorithms)

This paper contains 52 sections, 4 equations, 3 figures, 9 tables, 2 algorithms.

Figures (3)

  • Figure 1: Active learning (a), and active learning with interleaved self-training (b). For active learning, the most uncertain samples are labeled by the human annotator, while for self-training pseudo-labels are obtained from the current model using the most certain samples.
  • Figure 2: Learning curves per model, query strategy, and dataset, showing the classification performance on the test set. The x-axis shows the number of instances, while the y-axis indicates classification performance. The horizontal (red) line represents the performance of the respective model trained on 100% of the data (without active learning).
  • Figure 3: The effect of label noise for NeST and HAST on AGN. Each label is replaced by an incorrect random label with probability $\lambda$. The left side shows validation accuracy after the final active learning iteration. The right side shows the respective area under the learning curve for all 10 queries.