Towards Efficient Active Learning in NLP via Pretrained Representations

Artem Vysogorets; Achintya Gopal

Towards Efficient Active Learning in NLP via Pretrained Representations

Artem Vysogorets, Achintya Gopal

TL;DR

This work tackles the high computational cost of active learning with large pretrained language models for text classification. It introduces PRepAL, which precomputes fixed representations from a backbone LLM and trains only a lightweight classifier during each AL iteration, reserving full fine-tuning for the final labeled set. Across multiple datasets and backbones (e.g., BERT and RoBERTa), PRepAL achieves performance close to AL+FT while delivering orders-of-magnitude speedups, and the labeled data can transfer effectively to other models. The approach supports sequential labeling, compatibility with common acquisition functions, and flexible final-model choices, making efficient AL practical in real-world settings with scarce labels and evolving model ecosystems.

Abstract

Fine-tuning Large Language Models (LLMs) is now a common approach for text classification in a wide range of applications. When labeled documents are scarce, active learning helps save annotation efforts but requires retraining of massive models on each acquisition iteration. We drastically expedite this process by using pretrained representations of LLMs within the active learning loop and, once the desired amount of labeled data is acquired, fine-tuning that or even a different pretrained LLM on this labeled data to achieve the best performance. As verified on common text classification benchmarks with pretrained BERT and RoBERTa as the backbone, our strategy yields similar performance to fine-tuning all the way through the active learning loop but is orders of magnitude less computationally expensive. The data acquired with our procedure generalizes across pretrained networks, allowing flexibility in choosing the final model or updating it as newer versions get released.

Towards Efficient Active Learning in NLP via Pretrained Representations

TL;DR

Abstract

Paper Structure (18 sections, 8 figures, 4 tables, 1 algorithm)

This paper contains 18 sections, 8 figures, 4 tables, 1 algorithm.

Introduction
Our method.
Related Work
Data-based.
Model-based.
Prediction-based
Active learning with proxy models.
Method
Experiments
Active learning protocols.
Main results.
Selected data.
Transferability across models.
Reducing the batch size.
Discussion
...and 3 more sections

Figures (8)

Figure 1: Active learning with MaxEntropy acquisition function and BERT backbone on QNLI across different strategies over $39$ labeling iterations. Left: validation performance after training on labeled data thus far. Error bands represent $\pm 1$ standard deviation. Right: wall-clock time (in seconds) spent on each phase and validation accuracy of the final model trained on $2,000$ acquired samples. All models trained to convergence on five cores and a Tesla V100-SXM2-32GB GPU.
Figure 2: Validation accuracy of final models across different acquisition functions, retraining methods, and datasets. All use BERT as the backbone LLM. Error bands represent $\pm 1$ standard deviation.
Figure 3: SST-2 dataset. The red-toned curves and the grey curve show the validation accuracy of different models with different active learning protocols. The blue-toned curves indicate Jaccard similarity between subsets of data indices selected by different active learning protocols and the data indices selected by AL+FT. Error bands represent $\pm 1$ standard deviation. Among all acquisition functions, only DAL presents a visible performance gap between PRepAL and AL+FT.
Figure 4: IMDb dataset. The red-toned curves and the grey curve show the validation accuracy of different models with different active learning protocols. The blue-toned curves indicate Jaccard similarity between subsets of data indices selected by different active learning protocols and the data indices selected by AL+FT. Error bands represent $\pm 1$ standard deviation. PRepAL retains the above-random performance of AL+FT with MaxEntropy and VariationRatio.
Figure 5: Test accuracy of various active learning protocols with MaxEntropy acquisition function and a batch size (bs) of $50$ samples per iteration (red-toned curves), or $1$ sample per iteration (green-toned curves). Top: RoBERTa on SST-2; Bottom: RoBERTa on IMDb.
...and 3 more figures

Towards Efficient Active Learning in NLP via Pretrained Representations

TL;DR

Abstract

Towards Efficient Active Learning in NLP via Pretrained Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (8)