Table of Contents
Fetching ...

Random-Set Large Language Models

Muhammad Mubashar, Shireen Kudukkil Manchingal, Fabio Cuzzolin

TL;DR

The paper tackles the challenge of trust and uncertainty in large language models by introducing Random-set Large Language Models (RS-LLMs) that predict belief functions over token sets, rather than conventional probability vectors. A budgeting mechanism based on hierarchical clustering limits focal sets to a tractable budget, while training aligns belief functions with a BCE loss and mass-regularization terms; the pignistic distribution derived from the belief function guides next-token selection. RS-LLMs quantify epistemic uncertainty through credal-set width and BetP entropy, enabling assessment of prediction confidence and detection of hallucinations. Empirical results on CoQA and OBQA using multiple base models show improved accuracy and informative uncertainty signals, suggesting RS-LLMs can enhance reliability and interpretability in practical NLP tasks.

Abstract

Large Language Models (LLMs) are known to produce very high-quality tests and responses to our queries. But how much can we trust this generated text? In this paper, we study the problem of uncertainty quantification in LLMs. We propose a novel Random-Set Large Language Model (RSLLM) approach which predicts finite random sets (belief functions) over the token space, rather than probability vectors as in classical LLMs. In order to allow so efficiently, we also present a methodology based on hierarchical clustering to extract and use a budget of "focal" subsets of tokens upon which the belief prediction is defined, rather than using all possible collections of tokens, making the method scalable yet effective. RS-LLMs encode the epistemic uncertainty induced in their generation process by the size and diversity of its training set via the size of the credal sets associated with the predicted belief functions. The proposed approach is evaluated on CoQA and OBQA datasets using Llama2-7b, Mistral-7b and Phi-2 models and is shown to outperform the standard model in both datasets in terms of correctness of answer while also showing potential in estimating the second level uncertainty in its predictions and providing the capability to detect when its hallucinating.

Random-Set Large Language Models

TL;DR

The paper tackles the challenge of trust and uncertainty in large language models by introducing Random-set Large Language Models (RS-LLMs) that predict belief functions over token sets, rather than conventional probability vectors. A budgeting mechanism based on hierarchical clustering limits focal sets to a tractable budget, while training aligns belief functions with a BCE loss and mass-regularization terms; the pignistic distribution derived from the belief function guides next-token selection. RS-LLMs quantify epistemic uncertainty through credal-set width and BetP entropy, enabling assessment of prediction confidence and detection of hallucinations. Empirical results on CoQA and OBQA using multiple base models show improved accuracy and informative uncertainty signals, suggesting RS-LLMs can enhance reliability and interpretability in practical NLP tasks.

Abstract

Large Language Models (LLMs) are known to produce very high-quality tests and responses to our queries. But how much can we trust this generated text? In this paper, we study the problem of uncertainty quantification in LLMs. We propose a novel Random-Set Large Language Model (RSLLM) approach which predicts finite random sets (belief functions) over the token space, rather than probability vectors as in classical LLMs. In order to allow so efficiently, we also present a methodology based on hierarchical clustering to extract and use a budget of "focal" subsets of tokens upon which the belief prediction is defined, rather than using all possible collections of tokens, making the method scalable yet effective. RS-LLMs encode the epistemic uncertainty induced in their generation process by the size and diversity of its training set via the size of the credal sets associated with the predicted belief functions. The proposed approach is evaluated on CoQA and OBQA datasets using Llama2-7b, Mistral-7b and Phi-2 models and is shown to outperform the standard model in both datasets in terms of correctness of answer while also showing potential in estimating the second level uncertainty in its predictions and providing the capability to detect when its hallucinating.

Paper Structure

This paper contains 19 sections, 14 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Training and generation flow of RS-LLM. Training is performed in a parallel fashion using the teacher forcing method. Generation is done sequentially. For each token, the model predicts a belief function. Then the mass function, probability distribution and next token is subsequently computed/sampled from that belief function.
  • Figure 2: A belief function measures the total belief (sum of masses of its subsets) for a set (Eq. \ref{['eq:belief-mobius']}).
  • Figure 3: Proposed budgeting method for RS-LLM. First, embeddings are computed for all the tokens in vocabulary. Then, focal sets are computed using hierarchical clustering.
  • Figure 4: Training examples from CoQA and OBQA datasets. The text in black highlights the actual question, while the blue text represents prompt instructions. The model is trained to predict the text in green.
  • Figure 5: Behavior of uncertainty measures of Llama2 and RS-Llama2 with respect to the correctness and closeness to the groundtruth on CoQA and OBQA datasets.
  • ...and 1 more figures