Table of Contents
Fetching ...

Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation

Zhen Lin, Shubhendu Trivedi, Jimeng Sun

TL;DR

This work introduces Contextualized Sequence Likelihood (CSL), a confidence-measure method for natural language generation that reweights token logits using attention weights elicited by prompts targeted at the model's own output. By selecting a small, stable set of attention heads, CSL emphasizes tokens most relevant to the task context, producing a refined confidence score $C_{CSL} = \sum_{i=1}^n w_i l_i$ with $w_i = a_i / \sum a_i$. Across QA datasets (CoQA, TriviaQA, Natural Questions) and multiple open LLMs (e.g., LLaMA2-70B, Mistral-7B, Gemma-7B), CSL outperforms standard sequence likelihood and other baselines in AUROC and AUARC, and improves uncertainty measures when integrated with Semantic Entropy (SE+CSL). The study also shows that CSL-Next yields similar improvements and that a small set of heads generalizes across datasets, indicating that the attention signals capture meaningful, task-relevant concepts. Limitations include interpretability of attention, task-specific prompting, and lack of external fact-checking, suggesting avenues for future work in token-explanation prompts and cross-model calibration.

Abstract

The advent of large language models (LLMs) has dramatically advanced the state-of-the-art in numerous natural language generation tasks. For LLMs to be applied reliably, it is essential to have an accurate measure of their confidence. Currently, the most commonly used confidence score function is the likelihood of the generated sequence, which, however, conflates semantic and syntactic components. For instance, in question-answering (QA) tasks, an awkward phrasing of the correct answer might result in a lower probability prediction. Additionally, different tokens should be weighted differently depending on the context. In this work, we propose enhancing the predicted sequence probability by assigning different weights to various tokens using attention values elicited from the base LLM. By employing a validation set, we can identify the relevant attention heads, thereby significantly improving the reliability of the vanilla sequence probability confidence measure. We refer to this new score as the Contextualized Sequence Likelihood (CSL). CSL is easy to implement, fast to compute, and offers considerable potential for further improvement with task-specific prompts. Across several QA datasets and a diverse array of LLMs, CSL has demonstrated significantly higher reliability than state-of-the-art baselines in predicting generation quality, as measured by the AUROC or AUARC.

Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation

TL;DR

This work introduces Contextualized Sequence Likelihood (CSL), a confidence-measure method for natural language generation that reweights token logits using attention weights elicited by prompts targeted at the model's own output. By selecting a small, stable set of attention heads, CSL emphasizes tokens most relevant to the task context, producing a refined confidence score with . Across QA datasets (CoQA, TriviaQA, Natural Questions) and multiple open LLMs (e.g., LLaMA2-70B, Mistral-7B, Gemma-7B), CSL outperforms standard sequence likelihood and other baselines in AUROC and AUARC, and improves uncertainty measures when integrated with Semantic Entropy (SE+CSL). The study also shows that CSL-Next yields similar improvements and that a small set of heads generalizes across datasets, indicating that the attention signals capture meaningful, task-relevant concepts. Limitations include interpretability of attention, task-specific prompting, and lack of external fact-checking, suggesting avenues for future work in token-explanation prompts and cross-model calibration.

Abstract

The advent of large language models (LLMs) has dramatically advanced the state-of-the-art in numerous natural language generation tasks. For LLMs to be applied reliably, it is essential to have an accurate measure of their confidence. Currently, the most commonly used confidence score function is the likelihood of the generated sequence, which, however, conflates semantic and syntactic components. For instance, in question-answering (QA) tasks, an awkward phrasing of the correct answer might result in a lower probability prediction. Additionally, different tokens should be weighted differently depending on the context. In this work, we propose enhancing the predicted sequence probability by assigning different weights to various tokens using attention values elicited from the base LLM. By employing a validation set, we can identify the relevant attention heads, thereby significantly improving the reliability of the vanilla sequence probability confidence measure. We refer to this new score as the Contextualized Sequence Likelihood (CSL). CSL is easy to implement, fast to compute, and offers considerable potential for further improvement with task-specific prompts. Across several QA datasets and a diverse array of LLMs, CSL has demonstrated significantly higher reliability than state-of-the-art baselines in predicting generation quality, as measured by the AUROC or AUARC.
Paper Structure (24 sections, 5 equations, 8 figures, 10 tables)

This paper contains 24 sections, 5 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: The attention-eliciting prompt used in this paper (full version deferred to the Appendix due to space constraints). $optional_context, $question and $response are replaced with the corresponding values of a sample. In our experiments, $optional_context refers to the story and conversation history that accompanies each question in CoQA reddy-etal-2019-coqa.
  • Figure 2: Depending on the question, the attention-eliciting prompt introduced in \ref{['fig:prompt_short']} induces attention focusing on different parts of the same response ("when", "who" and "what"). In the plot on the right, we show the CDF of $\Delta_{attn}$, the change of attention weight on the corresponding concept when asked the relevant question, on all 1,024 heads of Mistral-7B. For example, for "when", we compute $\Delta_{attn}$ as the attention weight on "On July 20, 1969" when asked the "when" question minus the average of the cases where the other two questions were asked. In all cases, the attention significantly increases on the relevant tokens (p-value from one-sided t-test is at most 9e-90).
  • Figure 3: Scatter plot of test vs validation AUROC for confidence measures computed via \ref{['eq:main']} with different heads' attention weights, on Natural Questions (nq) with Mistral-7B model. The ranking is highly consistent---the best heads on the validation set continue to perform well on the test set. In this case, the Spearman correlation spearman1961proof is $>97\%$. We can thus pick only a small subset of the 1024 heads (or more for other LMs) to construct the final confidence measure $\text{C}_{CSL}\xspace$.
  • Figure 4: Histogram of the correlation between attentions from CSL and CSL-Next (top 10 heads' average). We keep only generations with more than 2 tokens. For most responses, the chosen heads' attentions are highly correlated, suggesting that both methods focus on the same tokens, as exemplified in \ref{['fig:main:case']}.
  • Figure 5: Tokens whose attention is increased are marked (others decreased). As expected, such re-weighting are not always interpretable, but help locating the more relevant tokens in general.
  • ...and 3 more figures