Table of Contents
Fetching ...

The Price of Format: Diversity Collapse in LLMs

Longfei Yun, Chenyang An, Zilong Wang, Letian Peng, Jingbo Shang

TL;DR

The paper demonstrates that instruction-tuned LLMs using structured prompt templates exhibit a pronounced diversity collapse, defined by $D_{ ext{template}} \ll D_{ ext{simple}}$, across open-ended generation tasks. Through controlled prompt-ablation and decoding analyses over five models and nine tasks, it shows that structural cues in templates act as strong priors, anchoring outputs and reducing early-stage entropy even at high temperatures. Diversity can be recovered by removing formatting or using natural instructions, but task performance becomes uneven across domains, with structure-sensitive tasks benefiting from format consistency while knowledge-heavy tasks sometimes suffer. The work highlights practical tradeoffs between alignment and creativity and calls for diversity-aware prompt design and instruction tuning to preserve expressive variation without sacrificing downstream capabilities.

Abstract

Instruction-tuned large language models (LLMs) employ structured templates, such as role markers and special tokens, to enforce format consistency during inference. However, we identify a critical limitation of such formatting: it induces a phenomenon we term diversity collapse, where the model generates semantically similar outputs for open-ended inputs, undermining creativity and variability. We systematically evaluate this effect across tasks like story completion and free-form generation, finding that (1) diversity collapse persists even under high-temperature sampling, and (2) structural tokens in templates significantly constrain the model's output space. To contextualize these findings, we fine-tune the same model using a range of structured prompts and then evaluate them across three axes: downstream task performance, alignment behavior, and output diversity. Our analysis shows that format consistency between fine-tuning and inference is crucial for structure-sensitive tasks (e.g., GSM8K, IFEval), but has marginal influence on knowledge-heavy tasks (e.g., MMLU, WebQuestions). In contrast, output diversity is primarily governed by the presence or absence of structural tokens, with minimal formatting yielding the most diverse outputs. These findings reveal that current prompting conventions, while beneficial for alignment, may inadvertently suppress output diversity, underscoring the need for diversity-aware prompt design and instruction tuning.

The Price of Format: Diversity Collapse in LLMs

TL;DR

The paper demonstrates that instruction-tuned LLMs using structured prompt templates exhibit a pronounced diversity collapse, defined by , across open-ended generation tasks. Through controlled prompt-ablation and decoding analyses over five models and nine tasks, it shows that structural cues in templates act as strong priors, anchoring outputs and reducing early-stage entropy even at high temperatures. Diversity can be recovered by removing formatting or using natural instructions, but task performance becomes uneven across domains, with structure-sensitive tasks benefiting from format consistency while knowledge-heavy tasks sometimes suffer. The work highlights practical tradeoffs between alignment and creativity and calls for diversity-aware prompt design and instruction tuning to preserve expressive variation without sacrificing downstream capabilities.

Abstract

Instruction-tuned large language models (LLMs) employ structured templates, such as role markers and special tokens, to enforce format consistency during inference. However, we identify a critical limitation of such formatting: it induces a phenomenon we term diversity collapse, where the model generates semantically similar outputs for open-ended inputs, undermining creativity and variability. We systematically evaluate this effect across tasks like story completion and free-form generation, finding that (1) diversity collapse persists even under high-temperature sampling, and (2) structural tokens in templates significantly constrain the model's output space. To contextualize these findings, we fine-tune the same model using a range of structured prompts and then evaluate them across three axes: downstream task performance, alignment behavior, and output diversity. Our analysis shows that format consistency between fine-tuning and inference is crucial for structure-sensitive tasks (e.g., GSM8K, IFEval), but has marginal influence on knowledge-heavy tasks (e.g., MMLU, WebQuestions). In contrast, output diversity is primarily governed by the presence or absence of structural tokens, with minimal formatting yielding the most diverse outputs. These findings reveal that current prompting conventions, while beneficial for alignment, may inadvertently suppress output diversity, underscoring the need for diversity-aware prompt design and instruction tuning.

Paper Structure

This paper contains 34 sections, 1 equation, 7 figures, 7 tables.

Figures (7)

  • Figure 1: News generation results under simple prompt (Left) and full chat template prompt (Right). Templated prompting significantly reduces topic diversity.
  • Figure 2: Semantic diversity comparison across Qwen3 qwen3 model sizes under two prompting modes, excluding the thinking mode. The results show that diversity collapse occurs consistently across model scales.
  • Figure 3: Structural diversity across prompting modes in the news generation task, measured by the standard deviation of content word ratio (left), sentence count (middle), and token length (right).
  • Figure 4: Entropy of the output space across decoding steps with and without templates. The figure shows that using a template significantly reduces entropy, indicating a more constrained and predictable output distribution.
  • Figure 5: Performance comparison across prompting modes (Full Template, Fake Template, Minimum Dialog, and Simple Steer) for three instruction-tuned language models on three representative tasks.
  • ...and 2 more figures