Table of Contents
Fetching ...

Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity

Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R. Tomz, Christopher D. Manning, Weiyan Shi

TL;DR

The paper identifies typicality bias in human preference data as a fundamental driver of mode collapse observed after post-training alignment. It formalizes the effect within a reward-and- KL-regularized optimization framework and introduces Verbalized Sampling (VS), a training-free prompting strategy that elicits a distributed set of responses with corresponding probabilities to recover the base model’s diversity. Across creative writing, dialogue simulation, open-ended QA, and synthetic data generation, VS significantly enhances output diversity while preserving factual accuracy and safety, with larger, more capable models benefiting more from the approach. This work offers a data-centric lens on alignment and provides a practical, inference-time remedy to unlock LLM creative potential and diversity without additional training.

Abstract

Post-training alignment often reduces LLM diversity, leading to a phenomenon known as mode collapse. Unlike prior work that attributes this effect to algorithmic limitations, we identify a fundamental, pervasive data-level driver: typicality bias in preference data, whereby annotators systematically favor familiar text as a result of well-established findings in cognitive psychology. We formalize this bias theoretically, verify it on preference datasets empirically, and show that it plays a central role in mode collapse. Motivated by this analysis, we introduce Verbalized Sampling, a simple, training-free prompting strategy to circumvent mode collapse. VS prompts the model to verbalize a probability distribution over a set of responses (e.g., "Generate 5 jokes about coffee and their corresponding probabilities"). Comprehensive experiments show that VS significantly improves performance across creative writing (poems, stories, jokes), dialogue simulation, open-ended QA, and synthetic data generation, without sacrificing factual accuracy and safety. For instance, in creative writing, VS increases diversity by 1.6-2.1x over direct prompting. We further observe an emergent trend that more capable models benefit more from VS. In sum, our work provides a new data-centric perspective on mode collapse and a practical inference-time remedy that helps unlock pre-trained generative diversity.

Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity

TL;DR

The paper identifies typicality bias in human preference data as a fundamental driver of mode collapse observed after post-training alignment. It formalizes the effect within a reward-and- KL-regularized optimization framework and introduces Verbalized Sampling (VS), a training-free prompting strategy that elicits a distributed set of responses with corresponding probabilities to recover the base model’s diversity. Across creative writing, dialogue simulation, open-ended QA, and synthetic data generation, VS significantly enhances output diversity while preserving factual accuracy and safety, with larger, more capable models benefiting more from the approach. This work offers a data-centric lens on alignment and provides a practical, inference-time remedy to unlock LLM creative potential and diversity without additional training.

Abstract

Post-training alignment often reduces LLM diversity, leading to a phenomenon known as mode collapse. Unlike prior work that attributes this effect to algorithmic limitations, we identify a fundamental, pervasive data-level driver: typicality bias in preference data, whereby annotators systematically favor familiar text as a result of well-established findings in cognitive psychology. We formalize this bias theoretically, verify it on preference datasets empirically, and show that it plays a central role in mode collapse. Motivated by this analysis, we introduce Verbalized Sampling, a simple, training-free prompting strategy to circumvent mode collapse. VS prompts the model to verbalize a probability distribution over a set of responses (e.g., "Generate 5 jokes about coffee and their corresponding probabilities"). Comprehensive experiments show that VS significantly improves performance across creative writing (poems, stories, jokes), dialogue simulation, open-ended QA, and synthetic data generation, without sacrificing factual accuracy and safety. For instance, in creative writing, VS increases diversity by 1.6-2.1x over direct prompting. We further observe an emergent trend that more capable models benefit more from VS. In sum, our work provides a new data-centric perspective on mode collapse and a practical inference-time remedy that helps unlock pre-trained generative diversity.

Paper Structure

This paper contains 126 sections, 20 equations, 30 figures, 32 tables.

Figures (30)

  • Figure 1: We show that typicality bias in preference data is a fundamental and pervasive cause of mode collapse, reducing output diversity. As a solution, we propose Verbalized Sampling (VS), a principled prompting method that returns distributions of responses, to improve diversity.
  • Figure 2: Ready-to-use Verbalized Sampling (VS) Prompt. See §\ref{['appendix:experiment_prompt']} for more variants and detail.
  • Figure 3: Qualitative and quantitative examples on different tasks. For story writing, VS improves the output diversity. For the donation dialogue simulation task, VS simulates a donation amount distribution much closer to the human distribution, and generates more realistic persuasion behaviors (e.g., resistances and change of minds, see Table \ref{['tab:example_simulated_dialogue']}). On the task of enumerative open-ended QA, we ask the model to "generate US states". We first query a pretraining corpus (RedPajama) to establish a "reference" distribution of US state names in the pretraining data. The verbalized probability distribution generated by VS, when averaged over 10 trials, closely aligns with this reference pretraining distribution (KL=0.12). In contrast, direct prompting collapses into a few modes, repeatedly outputting states like California and Texas. See §\ref{['appendix:probing_pre_training_data']} for more detail.
  • Figure 3: Human-rated diversity (1 = Very Similar, 4 = Very Dissimilar) for poem, story, and joke tasks under Direct, Sequence, and VS-Standard.
  • Figure 4: a-c: Average semantic diversity scores (%) in poem (a), story (b) and joke (c) across methods and models. Our methods consistently outperform the baselines. We performed a one-tailed t-test between VS-Standard and the baselines (* $p<0.05$, ** $p<0.01$, *** $p<0.001$). d: Diversity vs. Quality trade-off for the poem task, where VS-Multi and VS-CoT approach the Pareto front. e-f: Emergent Trend where larger models benefit more from VS. We show differences in diversity (e) and quality (f) over Direct across small (GPT-4.1-Mini, Gemini-2.5-Flash) and large (GPT-4.1, Gemini-2.5-Pro) models. g-i: Tunable Diversity shows the diversity tuning results on Gemini-2.5-Flash across tasks. Unlike baseline methods in dashed lines, we can tune the diversity level with VS: as the probability threshold decreases, diversity increases.
  • ...and 25 more figures

Theorems & Definitions (3)

  • proof
  • proof
  • proof