To Softmax, or not to Softmax: that is the question when applying Active Learning for Transformer Models
Julius Gonsior, Christian Falkenberg, Silvio Magino, Anja Reusch, Maik Thiele, Wolfgang Lehner
TL;DR
This paper investigates active learning for fine-tuning Transformer models and questions the reliability of softmax-based confidence. It rigorously compares eight alternative confidence-quantification methods across seven datasets using BERT-base and RoBERTa-base, and introduces UC, a simple top-k uncertainty clipping heuristic, to mitigate labeling of outliers. The findings show that while some alternatives can match or approach softmax performance, strategies that focus on the most uncertain samples often label outliers, hurting accuracy; UC generally improves performance across methods and models, with no single method dominating. The work offers practical guidance for AL in NLP and highlights the importance of calibration and outlier management in uncertainty-based sampling, all while sharing the experimental framework for reproducibility.
Abstract
Despite achieving state-of-the-art results in nearly all Natural Language Processing applications, fine-tuning Transformer-based language models still requires a significant amount of labeled data to work. A well known technique to reduce the amount of human effort in acquiring a labeled dataset is \textit{Active Learning} (AL): an iterative process in which only the minimal amount of samples is labeled. AL strategies require access to a quantified confidence measure of the model predictions. A common choice is the softmax activation function for the final layer. As the softmax function provides misleading probabilities, this paper compares eight alternatives on seven datasets. Our almost paradoxical finding is that most of the methods are too good at identifying the true most uncertain samples (outliers), and that labeling therefore exclusively outliers results in worse performance. As a heuristic we propose to systematically ignore samples, which results in improvements of various methods compared to the softmax function.
