Table of Contents
Fetching ...

SPUQ: Perturbation-Based Uncertainty Quantification for Large Language Models

Xiang Gao, Jiaxin Zhang, Lalla Mouatadid, Kamalika Das

TL;DR

This work introduces a novel UQ method, sampling with perturbation for UQ (SPUQ), designed to tackle both aleatoric and epistemic uncertainties in large language models, and shows a substantial improvement in model uncertainty calibration.

Abstract

In recent years, large language models (LLMs) have become increasingly prevalent, offering remarkable text generation capabilities. However, a pressing challenge is their tendency to make confidently wrong predictions, highlighting the critical need for uncertainty quantification (UQ) in LLMs. While previous works have mainly focused on addressing aleatoric uncertainty, the full spectrum of uncertainties, including epistemic, remains inadequately explored. Motivated by this gap, we introduce a novel UQ method, sampling with perturbation for UQ (SPUQ), designed to tackle both aleatoric and epistemic uncertainties. The method entails generating a set of perturbations for LLM inputs, sampling outputs for each perturbation, and incorporating an aggregation module that generalizes the sampling uncertainty approach for text generation tasks. Through extensive experiments on various datasets, we investigated different perturbation and aggregation techniques. Our findings show a substantial improvement in model uncertainty calibration, with a reduction in Expected Calibration Error (ECE) by 50\% on average. Our findings suggest that our proposed UQ method offers promising steps toward enhancing the reliability and trustworthiness of LLMs.

SPUQ: Perturbation-Based Uncertainty Quantification for Large Language Models

TL;DR

This work introduces a novel UQ method, sampling with perturbation for UQ (SPUQ), designed to tackle both aleatoric and epistemic uncertainties in large language models, and shows a substantial improvement in model uncertainty calibration.

Abstract

In recent years, large language models (LLMs) have become increasingly prevalent, offering remarkable text generation capabilities. However, a pressing challenge is their tendency to make confidently wrong predictions, highlighting the critical need for uncertainty quantification (UQ) in LLMs. While previous works have mainly focused on addressing aleatoric uncertainty, the full spectrum of uncertainties, including epistemic, remains inadequately explored. Motivated by this gap, we introduce a novel UQ method, sampling with perturbation for UQ (SPUQ), designed to tackle both aleatoric and epistemic uncertainties. The method entails generating a set of perturbations for LLM inputs, sampling outputs for each perturbation, and incorporating an aggregation module that generalizes the sampling uncertainty approach for text generation tasks. Through extensive experiments on various datasets, we investigated different perturbation and aggregation techniques. Our findings show a substantial improvement in model uncertainty calibration, with a reduction in Expected Calibration Error (ECE) by 50\% on average. Our findings suggest that our proposed UQ method offers promising steps toward enhancing the reliability and trustworthiness of LLMs.
Paper Structure (32 sections, 2 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 32 sections, 2 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of uncertainty quantification techniques: one-pass lin2022teachingkadavath2022languagechen1998evaluation, sampling-based si2022promptingwang2022self, and our SPUQ method. SPUQ addresses both epistemic (via perturbation) and aleatoric (via sampling) uncertainties. Aggregation yields the total uncertainty, distinguishing SPUQ from traditional methods focused mainly on aleatoric uncertainty.
  • Figure 2: Options associated with the perturbation (Section \ref{['sec-perturb-module']}) and aggregation modules (Section \ref{['sec-agg-module']}) of the SPUQ method.
  • Figure 3: An overview of the uncertainty calibration performance, measured by the Expected Calibration Error (ECE), for various uncertainty calibration methods across five LLMs over four question-answering datasets. A lower ECE indicates better uncertainty calibration.
  • Figure 4: The dependence of the uncertainty calibration, measured by the average confidence-accuracy Pearson correlation, on the number of perturbed samples, $k$. The general trend indicates that calibration improves as $k$ increases, but it plateaus approximately at $k$=5.
  • Figure 5: The distribution of ECE changes for specific temperature perturbations, taking into account variations in other hyperparameters. An increase in temperature (base value is $T_0=0.7$) during the sampling process tends to enhance calibration (decreased ECE).
  • ...and 5 more figures