Table of Contents
Fetching ...

Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation

Yuxuan Zhou, Margret Keuper, Mario Fritz

TL;DR

The paper tackles the difficulty of evaluating truncation-based sampling for open-ended text generation, where parameter tuning can mask a method's true capacity. It introduces a Context-Preserving Trie (CP-Trie) to estimate the data support at each prefix and defines probability-independent metrics, such as $\text{Recall}$ and $\text{Risk}$, together with a tuning-independent framework using $\text{AR}_{\text{Risk}}$ and $\text{RSE}_{\text{Risk}}$ to compare methods across models. Through large-scale experiments over multiple LLM families, it shows that adaptive methods (e.g., Adaptive sampling and Mirostat) typically improve diversity with controlled risk, while fixed Top-k/Top-p offer different trade-offs; TruthfulQA validation supports the relevance of the proposed metrics. The authors provide a practical guideline and a public CP-Trie benchmark for practitioners, emphasizing that evaluation should be decoupled from hyperparameter tuning and reflecting real-world data supports and model capacities.

Abstract

Sampling-based decoding strategies have been widely adopted for Large Language Models (LLMs) in numerous applications, targeting a balance between diversity and quality via temperature tuning and tail truncation. Considering the strong dependency of the candidate next tokens on different prefixes, recent studies propose to adaptively truncate the tail of LLMs' predicted distribution. Although improved results have been reported with these methods on open-ended text generation tasks, the results are highly dependent on the curated parameters and the limited exemplar text. In this paper, we propose a systematic way to estimate the capacity of a truncation sampling method by considering the trade-off between diversity and risk at each decoding step, based on our collected prefix tree which preserves the context of a full sentence. Our work offers a comprehensive comparison of existing truncation sampling methods and serves as a practical user guideline for their parameter selection.

Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation

TL;DR

The paper tackles the difficulty of evaluating truncation-based sampling for open-ended text generation, where parameter tuning can mask a method's true capacity. It introduces a Context-Preserving Trie (CP-Trie) to estimate the data support at each prefix and defines probability-independent metrics, such as and , together with a tuning-independent framework using and to compare methods across models. Through large-scale experiments over multiple LLM families, it shows that adaptive methods (e.g., Adaptive sampling and Mirostat) typically improve diversity with controlled risk, while fixed Top-k/Top-p offer different trade-offs; TruthfulQA validation supports the relevance of the proposed metrics. The authors provide a practical guideline and a public CP-Trie benchmark for practitioners, emphasizing that evaluation should be decoupled from hyperparameter tuning and reflecting real-world data supports and model capacities.

Abstract

Sampling-based decoding strategies have been widely adopted for Large Language Models (LLMs) in numerous applications, targeting a balance between diversity and quality via temperature tuning and tail truncation. Considering the strong dependency of the candidate next tokens on different prefixes, recent studies propose to adaptively truncate the tail of LLMs' predicted distribution. Although improved results have been reported with these methods on open-ended text generation tasks, the results are highly dependent on the curated parameters and the limited exemplar text. In this paper, we propose a systematic way to estimate the capacity of a truncation sampling method by considering the trade-off between diversity and risk at each decoding step, based on our collected prefix tree which preserves the context of a full sentence. Our work offers a comprehensive comparison of existing truncation sampling methods and serves as a practical user guideline for their parameter selection.
Paper Structure (22 sections, 4 equations, 8 figures, 7 tables)

This paper contains 22 sections, 4 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: N-gram models tend to overestimate the data support size given a prefix (marked by a red line) due to limited window size (marked with a blue window).
  • Figure 2: Histogram of the estimated optimal truncation values for gpt2-xl, which achieve exactly full recall of data support given different prefixes.
  • Figure 3: Illustration of the EnWiki CP-Trie. For brevity, only two child nodes are shown at each depth. The number at the left side of the slash symbol refers to the branching factor at the current node, and the number at the right side refers to the total number of leaves of the sub-tree with the current node as the root node.
  • Figure 4: The total number of leaves on the CP-Trie against the total number of processed articles.
  • Figure 5: Comparing the average Recalls at given average Risks using different model sizes.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Definition 3.1
  • Definition 3.2
  • Definition 4.1
  • Definition 4.2