Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation
Zhen Lin, Shubhendu Trivedi, Jimeng Sun
TL;DR
This work introduces Contextualized Sequence Likelihood (CSL), a confidence-measure method for natural language generation that reweights token logits using attention weights elicited by prompts targeted at the model's own output. By selecting a small, stable set of attention heads, CSL emphasizes tokens most relevant to the task context, producing a refined confidence score $C_{CSL} = \sum_{i=1}^n w_i l_i$ with $w_i = a_i / \sum a_i$. Across QA datasets (CoQA, TriviaQA, Natural Questions) and multiple open LLMs (e.g., LLaMA2-70B, Mistral-7B, Gemma-7B), CSL outperforms standard sequence likelihood and other baselines in AUROC and AUARC, and improves uncertainty measures when integrated with Semantic Entropy (SE+CSL). The study also shows that CSL-Next yields similar improvements and that a small set of heads generalizes across datasets, indicating that the attention signals capture meaningful, task-relevant concepts. Limitations include interpretability of attention, task-specific prompting, and lack of external fact-checking, suggesting avenues for future work in token-explanation prompts and cross-model calibration.
Abstract
The advent of large language models (LLMs) has dramatically advanced the state-of-the-art in numerous natural language generation tasks. For LLMs to be applied reliably, it is essential to have an accurate measure of their confidence. Currently, the most commonly used confidence score function is the likelihood of the generated sequence, which, however, conflates semantic and syntactic components. For instance, in question-answering (QA) tasks, an awkward phrasing of the correct answer might result in a lower probability prediction. Additionally, different tokens should be weighted differently depending on the context. In this work, we propose enhancing the predicted sequence probability by assigning different weights to various tokens using attention values elicited from the base LLM. By employing a validation set, we can identify the relevant attention heads, thereby significantly improving the reliability of the vanilla sequence probability confidence measure. We refer to this new score as the Contextualized Sequence Likelihood (CSL). CSL is easy to implement, fast to compute, and offers considerable potential for further improvement with task-specific prompts. Across several QA datasets and a diverse array of LLMs, CSL has demonstrated significantly higher reliability than state-of-the-art baselines in predicting generation quality, as measured by the AUROC or AUARC.
