Table of Contents
Fetching ...

Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation

Esteban Garces Arias, Meimingwei Li, Christian Heumann, Matthias Aßenmacher

TL;DR

This paper addresses how decoding hyperparameters shape open-ended text generation across multiple open-source LLMs and domains. It conducts a large-scale, grid-like evaluation of deterministic, sampling-based, and contrastive decoding strategies, using automatic metrics (Diversity, MAUVE, Coherence) and human judgments to assess quality. Key findings show no single method achieves human-level quality across all metrics; adaptive contrastive search and high-temperature or high-$p$ sampling often approach human-like performance, while deterministic approaches underperform for open-ended tasks. The work provides practical tuning guidelines and releases a 2.2 million-continuation dataset to support reproducibility and meta-analytic research, underscoring that hyperparameter choice can rival model size in determining text quality.

Abstract

Decoding strategies for generative large language models (LLMs) are a critical but often underexplored aspect of text generation tasks. Guided by specific hyperparameters, these strategies aim to transform the raw probability distributions produced by language models into coherent, fluent text. In this study, we undertake a large-scale empirical assessment of a range of decoding methods, open-source LLMs, textual domains, and evaluation protocols to determine how hyperparameter choices shape the outputs. Our experiments include both factual (e.g., news) and creative (e.g., fiction) domains, and incorporate a broad suite of automatic evaluation metrics alongside human judgments. Through extensive sensitivity analyses, we distill practical recommendations for selecting and tuning hyperparameters, noting that optimal configurations vary across models and tasks. By synthesizing these insights, this study provides actionable guidance for refining decoding strategies, enabling researchers and practitioners to achieve higher-quality, more reliable, and context-appropriate text generation outcomes.

Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation

TL;DR

This paper addresses how decoding hyperparameters shape open-ended text generation across multiple open-source LLMs and domains. It conducts a large-scale, grid-like evaluation of deterministic, sampling-based, and contrastive decoding strategies, using automatic metrics (Diversity, MAUVE, Coherence) and human judgments to assess quality. Key findings show no single method achieves human-level quality across all metrics; adaptive contrastive search and high-temperature or high- sampling often approach human-like performance, while deterministic approaches underperform for open-ended tasks. The work provides practical tuning guidelines and releases a 2.2 million-continuation dataset to support reproducibility and meta-analytic research, underscoring that hyperparameter choice can rival model size in determining text quality.

Abstract

Decoding strategies for generative large language models (LLMs) are a critical but often underexplored aspect of text generation tasks. Guided by specific hyperparameters, these strategies aim to transform the raw probability distributions produced by language models into coherent, fluent text. In this study, we undertake a large-scale empirical assessment of a range of decoding methods, open-source LLMs, textual domains, and evaluation protocols to determine how hyperparameter choices shape the outputs. Our experiments include both factual (e.g., news) and creative (e.g., fiction) domains, and incorporate a broad suite of automatic evaluation metrics alongside human judgments. Through extensive sensitivity analyses, we distill practical recommendations for selecting and tuning hyperparameters, noting that optimal configurations vary across models and tasks. By synthesizing these insights, this study provides actionable guidance for refining decoding strategies, enabling researchers and practitioners to achieve higher-quality, more reliable, and context-appropriate text generation outcomes.
Paper Structure (38 sections, 4 equations, 26 figures, 6 tables)

This paper contains 38 sections, 4 equations, 26 figures, 6 tables.

Figures (26)

  • Figure 1: Influence of the nucleus sampling hyperparameter $p$ on the distribution of diversity and coherence metrics in text generated by Mistral 7B v0.3 (green). For comparison, the distribution of the same metrics in human-written text is displayed in blue.
  • Figure 2: Top five and bottom five decoding strategies, based on QText averages for each dataset. The highest-ranking strategies generally strike a balance between coherence and diversity, while the lowest-ranking strategies tend to overemphasize one at the expense of the other—such as beam search, which favors coherence, or contrastive search with $\alpha = 1.0$ and $k = 50$, which prioritizes diversity.
  • Figure 3: Distribution of metric values per model, by using a Beam Search decoding strategy.
  • Figure 4: Effect of beam width on metric behavior.
  • Figure 5: Distribution of metric values per model, by using a Contrastive Search decoding strategy.
  • ...and 21 more figures