To Softmax, or not to Softmax: that is the question when applying Active Learning for Transformer Models

Julius Gonsior; Christian Falkenberg; Silvio Magino; Anja Reusch; Maik Thiele; Wolfgang Lehner

To Softmax, or not to Softmax: that is the question when applying Active Learning for Transformer Models

Julius Gonsior, Christian Falkenberg, Silvio Magino, Anja Reusch, Maik Thiele, Wolfgang Lehner

TL;DR

This paper investigates active learning for fine-tuning Transformer models and questions the reliability of softmax-based confidence. It rigorously compares eight alternative confidence-quantification methods across seven datasets using BERT-base and RoBERTa-base, and introduces UC, a simple top-k uncertainty clipping heuristic, to mitigate labeling of outliers. The findings show that while some alternatives can match or approach softmax performance, strategies that focus on the most uncertain samples often label outliers, hurting accuracy; UC generally improves performance across methods and models, with no single method dominating. The work offers practical guidance for AL in NLP and highlights the importance of calibration and outlier management in uncertainty-based sampling, all while sharing the experimental framework for reproducibility.

Abstract

Despite achieving state-of-the-art results in nearly all Natural Language Processing applications, fine-tuning Transformer-based language models still requires a significant amount of labeled data to work. A well known technique to reduce the amount of human effort in acquiring a labeled dataset is \textit{Active Learning} (AL): an iterative process in which only the minimal amount of samples is labeled. AL strategies require access to a quantified confidence measure of the model predictions. A common choice is the softmax activation function for the final layer. As the softmax function provides misleading probabilities, this paper compares eight alternatives on seven datasets. Our almost paradoxical finding is that most of the methods are too good at identifying the true most uncertain samples (outliers), and that labeling therefore exclusively outliers results in worse performance. As a heuristic we propose to systematically ignore samples, which results in improvements of various methods compared to the softmax function.

To Softmax, or not to Softmax: that is the question when applying Active Learning for Transformer Models

TL;DR

Abstract

Paper Structure (27 sections, 6 equations, 6 figures, 2 tables)

This paper contains 27 sections, 6 equations, 6 figures, 2 tables.

Introduction
AL 101
Confidence Probability Quantification Methods
IS (Single Network Deterministic Model)
TrSc (Single Network Deterministic Model)
Evi (Single Network Deterministic Model)
MC (Bayesian Method)
Softmax Ensemble (Ensemble Method)
TeSc (Softmax Calibration Method)
LS (Softmax Calibration Method)
UC
Experimental Setup
Setup
Active Learning Simulation
Transformer Models
...and 12 more sections

Figures (6)

Figure 1: Standard Active Learning Cycle including our proposed UC to influence the uncertainty based ranking (using the probability $P_\theta(y|x)$ of the learner model $\theta$ in predicting class $y$ for a sample $x$) by ignoring the top-$k$ results
Figure 2: Exemplary uncertainty values (equals one minus classification confidence probability) for single iteration of TREC-6 dataset before UC as histograms
Figure 3: Distribution of $acc_{last5}$ including the UC variants and the average $acc_{last5}$ values per dataset as colorful line, ordered by $acc_{last5}$ after UC. The arithmetic mean of the runs per method are included as a white diamond in the middle of the plots. The vanilla softmax based baselines Ent, MM und LC are marked in blue, and the baselines Random Selection as well as the Passive classifier are marked orange.
Figure 4: Heatmap for the Jaccard coefficients of the queried samples between each pair of strategies. High coefficients indicate highly similar strategies. On the right side the displayed numbers indicate the difference to the original coefficients.
Figure 5: Difference of class distribution of the queried samples compared to the train set for the two datasets TREC-6 and AG's News for the original BERT model
...and 1 more figures

To Softmax, or not to Softmax: that is the question when applying Active Learning for Transformer Models

TL;DR

Abstract

To Softmax, or not to Softmax: that is the question when applying Active Learning for Transformer Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)