Table of Contents
Fetching ...

Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency

Roman Vashurin, Maiya Goloburda, Albina Ilina, Aleksandr Rubashevskii, Preslav Nakov, Artem Shelmanov, Maxim Panov

TL;DR

Uncertainty quantification for LLMs faces challenges reconciling token-level confidence with semantic variability. The paper introduces CoCoA, a MB R-based framework that fuses model confidence with output consistency via a multiplicative risk, and CoCoA Light as a sample-efficient proxy. Extensive experiments across QA, summarization, and translation show substantial improvements over state-of-the-art UQ methods and across multiple open-weight LLMs, including Gemma. The work clarifies the theoretical link between risk-based uncertainty and MB R decoding while delivering practical gains in reliability for diverse generation tasks.

Abstract

Uncertainty quantification (UQ) methods for Large Language Models (LLMs) encompass a variety of approaches, with two major types being particularly prominent: information-based, which focus on model confidence expressed as token probabilities, and consistency-based, which assess the semantic relationship between multiple outputs generated using repeated sampling. Several recent methods have combined these two approaches to boost UQ performance. However, they sometimes fail to outperform much simpler baseline methods. Our work discusses the fundamental approach to constructing uncertainty measures that directly links uncertainty with the minimum Bayes risks achieved by LLM decoding. Building on these findings, we propose a novel approach to integrating model confidence with output consistency, resulting in a family of efficient and robust UQ methods. Our investigation reveals distinctive characteristics of LLMs as probabilistic models, which help to explain why these UQ methods underperform in certain tasks. Based on these findings, we propose a new way of synthesizing model confidence and output consistency, leading to a family of efficient and robust UQ methods. We evaluate our approach across various tasks such as question answering, abstractive summarization, and machine translation, demonstrating sizable improvements over state-of-the-art UQ approaches.

Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency

TL;DR

Uncertainty quantification for LLMs faces challenges reconciling token-level confidence with semantic variability. The paper introduces CoCoA, a MB R-based framework that fuses model confidence with output consistency via a multiplicative risk, and CoCoA Light as a sample-efficient proxy. Extensive experiments across QA, summarization, and translation show substantial improvements over state-of-the-art UQ methods and across multiple open-weight LLMs, including Gemma. The work clarifies the theoretical link between risk-based uncertainty and MB R decoding while delivering practical gains in reliability for diverse generation tasks.

Abstract

Uncertainty quantification (UQ) methods for Large Language Models (LLMs) encompass a variety of approaches, with two major types being particularly prominent: information-based, which focus on model confidence expressed as token probabilities, and consistency-based, which assess the semantic relationship between multiple outputs generated using repeated sampling. Several recent methods have combined these two approaches to boost UQ performance. However, they sometimes fail to outperform much simpler baseline methods. Our work discusses the fundamental approach to constructing uncertainty measures that directly links uncertainty with the minimum Bayes risks achieved by LLM decoding. Building on these findings, we propose a novel approach to integrating model confidence with output consistency, resulting in a family of efficient and robust UQ methods. Our investigation reveals distinctive characteristics of LLMs as probabilistic models, which help to explain why these UQ methods underperform in certain tasks. Based on these findings, we propose a new way of synthesizing model confidence and output consistency, leading to a family of efficient and robust UQ methods. We evaluate our approach across various tasks such as question answering, abstractive summarization, and machine translation, demonstrating sizable improvements over state-of-the-art UQ approaches.

Paper Structure

This paper contains 41 sections, 29 equations, 3 figures, 21 tables.

Figures (3)

  • Figure 1: Example of inconsistent probabilities assigned to semantically identical answers by an LLM, demonstrating the limitation of relying solely on sequence-level information.
  • Figure 2: Illustration of our method: the LLM generates a response, evaluates the similarity to alternatives, computes the confidence, and finally combines the confidence with the similarity measure. High similarity to alternatives reduces the uncertainty, while low similarity keeps it high.
  • Figure 3: Prediction-Rejection Ratio (PRR) Curve illustrating the quality of the non-rejected predictions as a function of the rejection rate. Oracle represents the optimal rejection strategy, Random is a random rejection, and UQ is rejection based on the evaluated uncertainty quantification method.