Uncertainty Quantification with Pre-trained Language Models: A Large-Scale Empirical Analysis

Yuxin Xiao; Paul Pu Liang; Umang Bhatt; Willie Neiswanger; Ruslan Salakhutdinov; Louis-Philippe Morency

Uncertainty Quantification with Pre-trained Language Models: A Large-Scale Empirical Analysis

Yuxin Xiao, Paul Pu Liang, Umang Bhatt, Willie Neiswanger, Ruslan Salakhutdinov, Louis-Philippe Morency

TL;DR

The paper addresses uncertainty quantification in PLM-based NLP pipelines by conducting a comprehensive, large-scale evaluation across four design choices: PLM selection, model size, uncertainty quantifier, and fine-tuning loss, tested on three tasks with in-domain and out-of-domain data. It systematically compares five base PLMs, finds ELECTRA to be consistently strong for calibration, and shows that larger PLMs tend to help under domain shift. Among uncertainty quantifiers, Temp Scaling best improves calibration with minimal cost, while ensembles offer limited gains in PLM settings. The results yield practical guidance and are complemented by released code and benchmarks to advance uncertainty quantification in NLP.

Abstract

Pre-trained language models (PLMs) have gained increasing popularity due to their compelling prediction performance in diverse natural language processing (NLP) tasks. When formulating a PLM-based prediction pipeline for NLP tasks, it is also crucial for the pipeline to minimize the calibration error, especially in safety-critical applications. That is, the pipeline should reliably indicate when we can trust its predictions. In particular, there are various considerations behind the pipeline: (1) the choice and (2) the size of PLM, (3) the choice of uncertainty quantifier, (4) the choice of fine-tuning loss, and many more. Although prior work has looked into some of these considerations, they usually draw conclusions based on a limited scope of empirical studies. There still lacks a holistic analysis on how to compose a well-calibrated PLM-based prediction pipeline. To fill this void, we compare a wide range of popular options for each consideration based on three prevalent NLP classification tasks and the setting of domain shift. In response, we recommend the following: (1) use ELECTRA for PLM encoding, (2) use larger PLMs if possible, (3) use Temp Scaling as the uncertainty quantifier, and (4) use Focal Loss for fine-tuning.

Uncertainty Quantification with Pre-trained Language Models: A Large-Scale Empirical Analysis

TL;DR

Abstract

Paper Structure (19 sections, 4 figures, 2 tables)

This paper contains 19 sections, 4 figures, 2 tables.

Introduction
Background
Problem Formulation
Related Work
Which Pre-trained Language Model?
Experiment Setup
Empirical Findings
What Model Size?
Experiment Setup
Empirical Findings
Which Uncertainty Quantifier?
Experiment Setup
Empirical Findings
Which Fine-tuning Loss?
Experiment Setup
...and 4 more sections

Figures (4)

Figure 1: Calibration and (selective) prediction performance of five PLMs in three NLP tasks under two domain settings. The calibration quality of the five PLMs is relatively consistent across tasks and domains, while XLNet is the least robust to domain shift. ELECTRA stands out due to its lowest scores in ECE, prediction error, and RPP.
Figure 2: Calibration and prediction performance of large and base PLMs in three NLP tasks under two domain settings. Larger PLMs calibrate better than their respective base versions when evaluated out-of-domain, while calibrating slightly worse in-domain with one exception in Commonsense Reasoning. If the computational budget permits, larger PLMs constitute more powerful pipelines given their lower out-of-domain ECE along with lower prediction error. We also observe a positive correlation between calibration and prediction error out-of-domain.
Figure 3: Change in calibration and prediction performance due to the use of four uncertainty quantifiers. The effectiveness of these quantifiers in reducing ECE follows the descending order of Temp Scaling, MC Dropout, Ensemble, and LL SVI. The drop in ECE is more significant out-of-domain. Temp Scaling is the most compelling fine-tuning loss due to its largest reduction in ECE, preservation of prediction results, and little computational cost.
Figure 4: Calibration and out-of-domain detection performance of BERT base models fine-tuned by five losses. Focal Loss, Label Smoothing, and MMCE are more capable of fine-tuning well-calibrated models compared to Cross Entropy and Brier Loss. Focal Loss is the best option due to its competitively low ECE and FAR95.

Uncertainty Quantification with Pre-trained Language Models: A Large-Scale Empirical Analysis

TL;DR

Abstract

Uncertainty Quantification with Pre-trained Language Models: A Large-Scale Empirical Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (4)