Table of Contents
Fetching ...

MARS: Meaning-Aware Response Scoring for Uncertainty Estimation in Generative LLMs

Yavuz Faruk Bakman, Duygu Nur Yaldiz, Baturalp Buyukates, Chenyang Tao, Dimitrios Dimitriadis, Salman Avestimehr

TL;DR

This work addresses the reliability of generative LLM outputs by improving uncertainty estimation (UE) with Meaning-Aware Response Scoring (MARS). MARS replaces length-normalized scoring by weighting token probabilities according to their semantic contribution to the answer, using a convex combination of length and meaning via $w(\cdot)=\frac{1}{2L}+\frac{u(\cdot)}{2}$ and a BERT-like model to estimate token importance. The authors implement a compact 110M-parameter model to detect phrase boundaries and assign phrase-level importance in a single pass, achieving universal UE improvements across multiple QA datasets and models with modest overhead. They demonstrate robust gains in AUROC for standardUE baselines (Confidence, Entropy, SE) and provide thorough ablations, analyses of hyperparameters, and a medical-domain evaluation to underline practical impact for trustworthy LLM applications. Overall, MARS offers a principled, scalable enhancement to UE in auto-regressive LLMs, facilitating safer deployment in high-stakes settings.

Abstract

Generative Large Language Models (LLMs) are widely utilized for their excellence in various tasks. However, their tendency to produce inaccurate or misleading outputs poses a potential risk, particularly in high-stakes environments. Therefore, estimating the correctness of generative LLM outputs is an important task for enhanced reliability. Uncertainty Estimation (UE) in generative LLMs is an evolving domain, where SOTA probability-based methods commonly employ length-normalized scoring. In this work, we propose Meaning-Aware Response Scoring (MARS) as an alternative to length-normalized scoring for UE methods. MARS is a novel scoring function that considers the semantic contribution of each token in the generated sequence in the context of the question. We demonstrate that integrating MARS into UE methods results in a universal and significant improvement in UE performance. We conduct experiments using three distinct closed-book question-answering datasets across five popular pre-trained LLMs. Lastly, we validate the efficacy of MARS on a Medical QA dataset. Code can be found https://github.com/Ybakman/LLM_Uncertainity.

MARS: Meaning-Aware Response Scoring for Uncertainty Estimation in Generative LLMs

TL;DR

This work addresses the reliability of generative LLM outputs by improving uncertainty estimation (UE) with Meaning-Aware Response Scoring (MARS). MARS replaces length-normalized scoring by weighting token probabilities according to their semantic contribution to the answer, using a convex combination of length and meaning via and a BERT-like model to estimate token importance. The authors implement a compact 110M-parameter model to detect phrase boundaries and assign phrase-level importance in a single pass, achieving universal UE improvements across multiple QA datasets and models with modest overhead. They demonstrate robust gains in AUROC for standardUE baselines (Confidence, Entropy, SE) and provide thorough ablations, analyses of hyperparameters, and a medical-domain evaluation to underline practical impact for trustworthy LLM applications. Overall, MARS offers a principled, scalable enhancement to UE in auto-regressive LLMs, facilitating safer deployment in high-stakes settings.

Abstract

Generative Large Language Models (LLMs) are widely utilized for their excellence in various tasks. However, their tendency to produce inaccurate or misleading outputs poses a potential risk, particularly in high-stakes environments. Therefore, estimating the correctness of generative LLM outputs is an important task for enhanced reliability. Uncertainty Estimation (UE) in generative LLMs is an evolving domain, where SOTA probability-based methods commonly employ length-normalized scoring. In this work, we propose Meaning-Aware Response Scoring (MARS) as an alternative to length-normalized scoring for UE methods. MARS is a novel scoring function that considers the semantic contribution of each token in the generated sequence in the context of the question. We demonstrate that integrating MARS into UE methods results in a universal and significant improvement in UE performance. We conduct experiments using three distinct closed-book question-answering datasets across five popular pre-trained LLMs. Lastly, we validate the efficacy of MARS on a Medical QA dataset. Code can be found https://github.com/Ybakman/LLM_Uncertainity.
Paper Structure (37 sections, 13 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 37 sections, 13 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of Meaning-Aware Response Scoring (MARS). Each token in the response of a generative LLM is assigned a weight based on its importance in the meaning. The product of the weighted probabilities of these tokens yields the response score. MARS is then used for Uncertainty Estimation (UE) methods in generative LLMs.
  • Figure 2: The most common probability-based UE methods for generative LLMs. The aim is to calculate the uncertainty of the most probable answer (shown in darker green) to the given question. Length-normalized scoring (\ref{['length-normalized-prob']}) is used in all these methods to obtain output scores. We propose MARS to replace it in these schemes.
  • Figure 3: Our Bert-like transformer model takes the question and the generated answer as inputs, and outputs phrases in the generated answer and corresponding importance coefficients.
  • Figure 4: AUROC scores for various temperatures and sampling numbers.