Table of Contents
Fetching ...

Evaluating the Diversity and Quality of LLM Generated Content

Alexander Shypula, Shuo Li, Botong Zhang, Vishakh Padmakumar, Kayo Yin, Osbert Bastani

TL;DR

This work identifies the gap in measuring diversity that is meaningful for open-ended LLM outputs by defining effective semantic diversity, which accounts for outputs meeting a quality threshold. It formalizes the framework with a validity function $V$ and a semantic map $S$, and introduces two robust diversity measures, $Div_{fixed}$ and $Div_{pair}$, to compare outputs while controlling for sample size. Through large-scale experiments across Llama-2 and Llama-3.1 families with varying post-training strategies (SFT, DPO, PPO, GRPO) and prompt templates, it uncovers counterintuitive results: preference-tuning, especially RL, increases effective semantic diversity by producing more high-quality outputs despite reducing lexical and syntactic diversity, while larger models raise semantic diversity without diminishing form diversity. The findings imply practical guidance for open-ended generation and synthetic data tasks, showing that smaller models can be more parameter-efficient for generating unique content, and that semantic-focused diversity metrics are essential for evaluating and guiding post-training strategies. The proposed framework is broadly applicable across domains and can inform future developments in alignment and evaluation of LLMs for diverse, high-quality outputs.

Abstract

Recent work suggests that preference-tuning techniques--including Reinforcement Learning from Human Preferences (RLHF) methods like PPO and GRPO, as well as alternatives like DPO--reduce diversity, creating a dilemma given that such models are widely deployed in applications requiring diverse outputs. To address this, we introduce a framework for measuring effective semantic diversity--diversity among outputs that meet quality thresholds--which better reflects the practical utility of large language models (LLMs). Using open-ended tasks that require no human intervention, we find counterintuitive results: although preference-tuned models--especially those trained via RL--exhibit reduced lexical and syntactic diversity, they produce greater effective semantic diversity than SFT or base models, not from increasing diversity among high-quality outputs, but from generating more high-quality outputs overall. We discover that preference tuning reduces syntactic diversity while preserving semantic diversity--revealing a distinction between diversity in form and diversity in content that traditional metrics often overlook. Our analysis further shows that smaller models are consistently more parameter-efficient at generating unique content within a fixed sampling budget, offering insights into the relationship between model scaling and diversity. These findings have important implications for applications that require diverse yet high-quality outputs, from creative assistance to synthetic data generation.

Evaluating the Diversity and Quality of LLM Generated Content

TL;DR

This work identifies the gap in measuring diversity that is meaningful for open-ended LLM outputs by defining effective semantic diversity, which accounts for outputs meeting a quality threshold. It formalizes the framework with a validity function and a semantic map , and introduces two robust diversity measures, and , to compare outputs while controlling for sample size. Through large-scale experiments across Llama-2 and Llama-3.1 families with varying post-training strategies (SFT, DPO, PPO, GRPO) and prompt templates, it uncovers counterintuitive results: preference-tuning, especially RL, increases effective semantic diversity by producing more high-quality outputs despite reducing lexical and syntactic diversity, while larger models raise semantic diversity without diminishing form diversity. The findings imply practical guidance for open-ended generation and synthetic data tasks, showing that smaller models can be more parameter-efficient for generating unique content, and that semantic-focused diversity metrics are essential for evaluating and guiding post-training strategies. The proposed framework is broadly applicable across domains and can inform future developments in alignment and evaluation of LLMs for diverse, high-quality outputs.

Abstract

Recent work suggests that preference-tuning techniques--including Reinforcement Learning from Human Preferences (RLHF) methods like PPO and GRPO, as well as alternatives like DPO--reduce diversity, creating a dilemma given that such models are widely deployed in applications requiring diverse outputs. To address this, we introduce a framework for measuring effective semantic diversity--diversity among outputs that meet quality thresholds--which better reflects the practical utility of large language models (LLMs). Using open-ended tasks that require no human intervention, we find counterintuitive results: although preference-tuned models--especially those trained via RL--exhibit reduced lexical and syntactic diversity, they produce greater effective semantic diversity than SFT or base models, not from increasing diversity among high-quality outputs, but from generating more high-quality outputs overall. We discover that preference tuning reduces syntactic diversity while preserving semantic diversity--revealing a distinction between diversity in form and diversity in content that traditional metrics often overlook. Our analysis further shows that smaller models are consistently more parameter-efficient at generating unique content within a fixed sampling budget, offering insights into the relationship between model scaling and diversity. These findings have important implications for applications that require diverse yet high-quality outputs, from creative assistance to synthetic data generation.

Paper Structure

This paper contains 17 sections, 17 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Diversity and Quality metrics when modulating the Temperature parameter across different models. We use the CodeBertScorecodebertscore2023 model for neural cosine diversity.
  • Figure 2: An example of an open-ended problem description from our dataset.
  • Figure 3: Effective semantic diveresity scores for all 19 models evaluated in our experiments, grouped by model family. Each bar is color-coded according to the post-training method, as categorized in Table \ref{['tab:model_taxonomy']}
  • Figure 4: Model efficiency: We plot the parameter efficiency of a model in generating unique examples vs. model size (log-scale).
  • Figure 5: Neural diversity metrics when modulating the Temperature parameter across different models. We report neural diversity metrics using the ICEScoreicescore and cosine diversity of CodeLLama-7B-Instruct embeddings.
  • ...and 5 more figures