On Uncertainty In Natural Language Processing

Dennis Ulmer

On Uncertainty In Natural Language Processing

Dennis Ulmer

TL;DR

This thesis studies how uncertainty in natural language processing can be characterized from a linguistic, statistical and neural perspective, and how it can be reduced and quantified through the design of the experimental pipeline, and develops an approach to quantify confidence in large black-box language models using auxiliary predictors.

Abstract

The last decade in deep learning has brought on increasingly capable systems that are deployed on a wide variety of applications. In natural language processing, the field has been transformed by a number of breakthroughs including large language models, which are used in increasingly many user-facing applications. In order to reap the benefits of this technology and reduce potential harms, it is important to quantify the reliability of model predictions and the uncertainties that shroud their development. This thesis studies how uncertainty in natural language processing can be characterized from a linguistic, statistical and neural perspective, and how it can be reduced and quantified through the design of the experimental pipeline. We further explore uncertainty quantification in modeling by theoretically and empirically investigating the effect of inductive model biases in text classification tasks. The corresponding experiments include data for three different languages (Danish, English and Finnish) and tasks as well as a large set of different uncertainty quantification approaches. Additionally, we propose a method for calibrated sampling in natural language generation based on non-exchangeable conformal prediction, which provides tighter token sets with better coverage of the actual continuation. Lastly, we develop an approach to quantify confidence in large black-box language models using auxiliary predictors, where the confidence is predicted from the input to and generated output text of the target model alone.

On Uncertainty In Natural Language Processing

TL;DR

Abstract

Paper Structure (274 sections, 12 theorems, 144 equations, 57 figures, 34 tables, 3 algorithms)

This paper contains 274 sections, 12 theorems, 144 equations, 57 figures, 34 tables, 3 algorithms.

Neural Network.
Neural Network Functions.
Indicator function.
Statistics.
Special Functions.
Introduction
Motivation
Applications
Safety.
Trust.
Fairness.
Efficiency.
Interpretability.
Challenges in Natural Language Processing
Challenges of Natural Language.
...and 259 more sections

Key Result

Lemma 1

Suppose $f_{\mathop{\mathrm{\bm{\theta}}}\nolimits}$ is a ReLU network according to Definition def:logit. Then $f_{\mathop{\mathrm{\bm{\theta}}}\nolimits}$ is a component-wise strictly monotonic function on every of its polytopes $Q\in\mathcal{Q}$, as long as its corresponding matrix $\mathop{\mathr

Figures (57)

Figure 1.1: (a) Number of models published per month on the https://huggingface.co/ (gathered on 17.06.2024). (b) Trajectory of the first two sentences of turing1950computing using in the latent space of the uppermost layer of Bert devlin2019bert, after projecting them into two-dimensional space using PCA and whitening them. Time is indicated by color, reaching from dark (first token) to light (last token). (c) Number of articles by Wikipedia, log-scale (gathered on 11.04.2024). Shown are the top ten languages, and then ten randomly chosen languages of the remaining four quantiles of the distribution, each. All figures are best viewed in color and digitally.
Figure 1.1: Illustration taken from the work of gao2017properties, illustrating the interplay of softmax probabilities between components for $K=2$ in $\mathbb{R}^2$.
Figure 2.1: Plots of different choices of (a) Beta prior distributions and their resulting (b) posterior distributions after observing our initial set of coin flips. Juxtaposing the two plots illustrates how the choice of prior belief can influence the shape of the resulting posterior distribution.
Figure 2.1: Comparing test score distributions for different tests and distributions as a function of sample size.
Figure 2.2: Highest density intervals (gray regions) and maximum a posteriori estimates (red vertical lines) for different Beta posteriors.
...and 52 more figures

Theorems & Definitions (31)

Definition 1: Interpersonal Trust; mayer1995integrative
Definition 2: Human-AI Trust; jacovi2021formalizing
Definition 3: Model
Definition 4: Metric & Observation
Example 1: Part-of-Speech tagging
Definition 5: ReLU Classifier
Definition 6: Monotonicity in Multivariate Functions
Lemma 1
proof
Lemma 2
...and 21 more

On Uncertainty In Natural Language Processing

TL;DR

Abstract

On Uncertainty In Natural Language Processing

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (57)

Theorems & Definitions (31)