Table of Contents
Fetching ...

SimpleStrat: Diversifying Language Model Generation with Stratification

Justin Wong, Yury Orlovskiy, Michael Luo, Sanjit A. Seshia, Joseph E. Gonzalez

TL;DR

This work proposes SimpleStrat, an alternative approach that uses the language model itself to partition the space into strata and achieves higher recall on ground truth solutions, and introduces CoverageQA, a dataset of underspecified questions with multiple equally plausible answers.

Abstract

Generating diverse responses from large language models (LLMs) is crucial for applications such as planning/search and synthetic data generation, where diversity provides distinct answers across generations. Prior approaches rely on increasing temperature to increase diversity. However, contrary to popular belief, we show not only does this approach produce lower quality individual generations as temperature increases, but it depends on model's next-token probabilities being similar to the true distribution of answers. We propose SimpleStrat, an alternative approach that uses the language model itself to partition the space into strata. At inference, a random stratum is selected and a sample drawn from within the strata. To measure diversity, we introduce CoverageQA, a dataset of underspecified questions with multiple equally plausible answers, and assess diversity by measuring KL Divergence between the output distribution and uniform distribution over valid ground truth answers. As computing probability per response/solution for proprietary models is infeasible, we measure recall on ground truth solutions. Our evaluation show using SimpleStrat achieves higher recall by 0.05 compared to GPT-4o and 0.36 average reduction in KL Divergence compared to Llama 3.

SimpleStrat: Diversifying Language Model Generation with Stratification

TL;DR

This work proposes SimpleStrat, an alternative approach that uses the language model itself to partition the space into strata and achieves higher recall on ground truth solutions, and introduces CoverageQA, a dataset of underspecified questions with multiple equally plausible answers.

Abstract

Generating diverse responses from large language models (LLMs) is crucial for applications such as planning/search and synthetic data generation, where diversity provides distinct answers across generations. Prior approaches rely on increasing temperature to increase diversity. However, contrary to popular belief, we show not only does this approach produce lower quality individual generations as temperature increases, but it depends on model's next-token probabilities being similar to the true distribution of answers. We propose SimpleStrat, an alternative approach that uses the language model itself to partition the space into strata. At inference, a random stratum is selected and a sample drawn from within the strata. To measure diversity, we introduce CoverageQA, a dataset of underspecified questions with multiple equally plausible answers, and assess diversity by measuring KL Divergence between the output distribution and uniform distribution over valid ground truth answers. As computing probability per response/solution for proprietary models is infeasible, we measure recall on ground truth solutions. Our evaluation show using SimpleStrat achieves higher recall by 0.05 compared to GPT-4o and 0.36 average reduction in KL Divergence compared to Llama 3.

Paper Structure

This paper contains 24 sections, 2 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Stratfied Sampling vs Temperature Scaling Consider the LLM user request "Name a US State." SimpleStrat employs auto-stratification to utilize the LLM to identify good dimensions of diversity, for instance "East/West of the Mississippi River." Then, SimpleStrat uses stratified sampling to diversify LLM generations.
  • Figure 2: SimpleStrat workflow. SimpleStrat employs 3 phases: 1) auto-stratification to identify good dimensions of diversity that divide the solution space into equal partitions, 2) heuristic estimation to estimate the proportion of solutions in each stratum, and 3) probabilistic prompting where a concrete prompt is randomly sampled from the prompt distribution specified by the previous two phases. Critically, diverse resampling comes from both the random choice of prompt as well as the temperature of the LLM decoding.
  • Figure 3: Diversity scaled with temperature. We show 100 resamples of "Name one Great Lake in the United States." On the right, we show the result of resampling GPT-4o 100 times per temperature. In contrast to SimpleStrat on the left, GPT-4o at temperature 1.5 still only samples Lake Huron once and never samples Lake Ontario. SimpleStrat improves the diversity across all temperatures.
  • Figure 4: Diversity measured with recall scaled with temperature. The figure shows the improved recall on CoverageQA compared to GPT-4o and Claude 3.5. Recall indicates the percentage of ground truth questions observed after sampling 100 times. The benefit of SimpleStrat is especially pronounced at low temperatures, but the benefit is evident across all temperatures.
  • Figure 5: KL divergence from uniform for Baseline vs SimpleStrat on CoverageQA Wikipedia. Lower divergence indicates closer alignment with the desired uniform distribution, arrow indicates direction of maximum improvement from baseline
  • ...and 4 more figures