Unused information in token probability distribution of generative LLM: improving LLM reading comprehension through calculation of expected values

Krystian Zawistowski

Unused information in token probability distribution of generative LLM: improving LLM reading comprehension through calculation of expected values

Krystian Zawistowski

TL;DR

This work investigates decoding strategies for generative LLMs, proposing a dynamic, high-entropy expected-value decoding method that leverages residual information in the next-token distribution. By comparing $E(s)$ to greedy decoding across multiple models and datasets, the authors show that upscaling entropy (e.g., via $T=10$) improves alignment with human judgments on SummEval, with notable gains for Mixtral variants and viable performance for 4-bit quantized models. They also introduce a tree-based probability analysis (tree-crawling topP) to explore the space of probable completions and reveal phenomena like positional bias and stopping instability. Together, these results support rethinking standard decoding practices to enhance reading-comprehension related metrics and provide a practical tool for decoding analysis and reliability in LLM applications. The work suggests practical implications for deploying robust, controllable LLM systems, including RAG pipelines with quantized models.

Abstract

LLM text decoding is key component for perceived LLM quality. We demonstrate two experiments showing that decoding methods could be improved by manipulation of token probabilities. First, we test few LLM on SummEval summary scoring dataset, to measure reading comprehension. We compare scores from greedy decoding to expected values over the next token distribution. We scale logits by large temperature to increase the entropy of scores. This allows strong improvement of performance on SummEval (in terms of correlations to human judgement). We see improvement from 6-8% to 13-28% for 7B Mistral and from 20%-46% to 37%-56% for Mixtral, beating GPT 4 0314 result on two metrics. Part of the gain seems related to positional bias. Secondly, we use probability-based tree sampling algorithm, to examine all most probable generations for given prompt.

Unused information in token probability distribution of generative LLM: improving LLM reading comprehension through calculation of expected values

TL;DR

to greedy decoding across multiple models and datasets, the authors show that upscaling entropy (e.g., via

) improves alignment with human judgments on SummEval, with notable gains for Mixtral variants and viable performance for 4-bit quantized models. They also introduce a tree-based probability analysis (tree-crawling topP) to explore the space of probable completions and reveal phenomena like positional bias and stopping instability. Together, these results support rethinking standard decoding practices to enhance reading-comprehension related metrics and provide a practical tool for decoding analysis and reliability in LLM applications. The work suggests practical implications for deploying robust, controllable LLM systems, including RAG pipelines with quantized models.

Abstract

Paper Structure (8 sections, 5 equations, 1 figure, 7 tables)

This paper contains 8 sections, 5 equations, 1 figure, 7 tables.

Introduction
Summary evaluation with expected value decoding.
Expected value decoding.
Results
Positional bias.
Statistical analysis.
Tree-based sampling
Conclusions

Figures (1)

Figure 1: Conceptual diagram of presented approach: instead of answering with most probable token, we calculate expected value for temperature $T=10$ to utilize residual information in next-token distribution.

Unused information in token probability distribution of generative LLM: improving LLM reading comprehension through calculation of expected values

TL;DR

Abstract

Unused information in token probability distribution of generative LLM: improving LLM reading comprehension through calculation of expected values

Authors

TL;DR

Abstract

Table of Contents

Figures (1)