Table of Contents
Fetching ...

Finetuning Language Models to Emit Linguistic Expressions of Uncertainty

Arslan Chaudhry, Sridhar Thiagarajan, Dilan Gorur

TL;DR

This work measures the calibration of pre-trained models and then fine-tune language models to generate calibrated linguistic expressions of uncertainty, and demonstrates that LLMs are well-calibrated in assessing their predictions, and supervised finetuning based on the model's own confidence leads to well-calibrated expressions of uncertainty, particularly for single-claim answers.

Abstract

Large language models (LLMs) are increasingly employed in information-seeking and decision-making tasks. Despite their broad utility, LLMs tend to generate information that conflicts with real-world facts, and their persuasive style can make these inaccuracies appear confident and convincing. As a result, end-users struggle to consistently align the confidence expressed by LLMs with the accuracy of their predictions, often leading to either blind trust in all outputs or a complete disregard for their reliability. In this work, we explore supervised finetuning on uncertainty-augmented predictions as a method to develop models that produce linguistic expressions of uncertainty. Specifically, we measure the calibration of pre-trained models and then fine-tune language models to generate calibrated linguistic expressions of uncertainty. Through experiments on various question-answering datasets, we demonstrate that LLMs are well-calibrated in assessing their predictions, and supervised finetuning based on the model's own confidence leads to well-calibrated expressions of uncertainty, particularly for single-claim answers.

Finetuning Language Models to Emit Linguistic Expressions of Uncertainty

TL;DR

This work measures the calibration of pre-trained models and then fine-tune language models to generate calibrated linguistic expressions of uncertainty, and demonstrates that LLMs are well-calibrated in assessing their predictions, and supervised finetuning based on the model's own confidence leads to well-calibrated expressions of uncertainty, particularly for single-claim answers.

Abstract

Large language models (LLMs) are increasingly employed in information-seeking and decision-making tasks. Despite their broad utility, LLMs tend to generate information that conflicts with real-world facts, and their persuasive style can make these inaccuracies appear confident and convincing. As a result, end-users struggle to consistently align the confidence expressed by LLMs with the accuracy of their predictions, often leading to either blind trust in all outputs or a complete disregard for their reliability. In this work, we explore supervised finetuning on uncertainty-augmented predictions as a method to develop models that produce linguistic expressions of uncertainty. Specifically, we measure the calibration of pre-trained models and then fine-tune language models to generate calibrated linguistic expressions of uncertainty. Through experiments on various question-answering datasets, we demonstrate that LLMs are well-calibrated in assessing their predictions, and supervised finetuning based on the model's own confidence leads to well-calibrated expressions of uncertainty, particularly for single-claim answers.
Paper Structure (15 sections, 2 equations, 21 figures, 2 tables, 2 algorithms)

This paper contains 15 sections, 2 equations, 21 figures, 2 tables, 2 algorithms.

Figures (21)

  • Figure 1: Motivation: The agent provides an incorrect response to a given query. In the response at the bottom, however, the agent includes an uncertainty expression. Without this uncertainty expression, as seen in the response at the top, the human user might form an incorrect belief about the world. In contrast, with the uncertainty-augmented response at the bottom, the human user is prompted to consult additional resources, leading to a more accurate understanding of the world.
  • Figure 2: Finetuning dataset curation process: Here LLM refers to the language model that we are interested in finetuning. LLM$^*$ referes to an operation that mixes the uncertainty expression with the model prediction -- it can be a prompted language model (interleaved case) or simply an operation which prefixes/post-fixes the answer with the expression of uncertainty. Given a question on the left, the LLM produces a raw prediction and then computes its own confidence on that prediction. The confidence score is converted to a linguistic expression and augmented with the raw prediction. Prompt \ref{['prompt:self_evaluation']} and Prompt \ref{['prompt:interleave_uncertainty_answer']} are given in the appendix.
  • Figure 3: Evaluation process: Finetuned LLM produces an answer with the expression of uncertianty on the left that is split by a prompted LLM$^*$ into the raw answer ('Joseph Warwick') and expression of uncertainty ('Highly unlikely') using Prompt \ref{['prompt:deaugmentation']}. LLM$^*$ then judges the correctness of the raw answer using the LME Prompt \ref{['prompt:lme']} and uncertainty expression is converted to a float equal to the average of the probability range the uncertainty expression belongs to. Based on the correctness and uncertainty score, the final metric is computed.
  • Figure 4: TriviaA Calibration Chart: The top-row shows raw calibration scores at temperature=1.0 without any post-processing. The bottom row shows post-processed calibration scores with isotonic regression. In each plot, the x-axis is the $p_{model}(true)$ of the generated prediction (shown here as Confidence Bin) and y-axis is probability of that prediction being actually correct (shown here as Accuracy). Expected Calibration Error (ECE) and Brier Score are reported at the top of each plot. The error bars show the variance of accuracy in each bin.
  • Figure 5: Calibration Charts of Finetuned Models: Top-row is TriviaQA. Bottom-row is AmibQA. The model generates post-fixed uncertainty expressions. The x-axis is the $p_{model}(true)$ obtained by converting the linguistic expression of uncertainty to a float using Table \ref{['tab:expression_map']} (shown here as Confidence Bin) and y-axis is probability of that prediction being actually correct (shown here as Accuracy). No post-processing is done on the $p_{model}(true)$. Expected Calibration Error (ECE) and Brier Score are reported at the top of each plot. The error bars show the variance of accuracy in each bin.
  • ...and 16 more figures