Table of Contents
Fetching ...

LLM Output Homogenization is Task Dependent

Shomik Jain, Jack Lanchantin, Maximilian Nickel, Karen Ullrich, Ashia Wilson, Jamelle Watson-Daniels

TL;DR

This work argues that LLM output homogenization is not universally good or bad but depends on the task. It introduces an eight-category task taxonomy, a task-anchored notion of functional diversity, and a sampling approach that increases diversity where undesired while preserving it where beneficial. Through extensive experiments across multiple models and datasets, the authors show that task-aware diversity can improve evaluation and mitigation of homogenization without harming quality, challenging the assumed diversity-quality trade-off. The framework offers practical guidance for designing inference-time prompts and alignment strategies that respect task-specific diversity requirements. Overall, the paper promotes task-centric evaluation as essential for robust, safe, and useful LLM deployment.

Abstract

A large language model can be less helpful if it exhibits output response homogenization. But whether two responses are considered homogeneous, and whether such homogenization is problematic, both depend on the task category. For instance, in objective math tasks, we often expect no variation in the final answer but anticipate variation in the problem-solving strategy. Whereas, for creative writing tasks, we may expect variation in key narrative components (e.g. plot, genre, setting, etc), beyond the vocabulary or embedding diversity produced by temperature-sampling. Previous work addressing output homogenization often fails to conceptualize diversity in a task-dependent way. We address this gap in the literature directly by making the following contributions. (1) We present a task taxonomy comprised of eight task categories that each have distinct concepts of output homogenization. (2) We introduce task-anchored functional diversity to better evaluate output homogenization. (3) We propose a task-anchored sampling technique that increases functional diversity for task categories where homogenization is undesired, while preserving it where it is desired. (4) We challenge the perceived existence of a diversity-quality trade-off by increasing functional diversity while maintaining response quality. Overall, we demonstrate how task dependence improves the evaluation and mitigation of output homogenization.

LLM Output Homogenization is Task Dependent

TL;DR

This work argues that LLM output homogenization is not universally good or bad but depends on the task. It introduces an eight-category task taxonomy, a task-anchored notion of functional diversity, and a sampling approach that increases diversity where undesired while preserving it where beneficial. Through extensive experiments across multiple models and datasets, the authors show that task-aware diversity can improve evaluation and mitigation of homogenization without harming quality, challenging the assumed diversity-quality trade-off. The framework offers practical guidance for designing inference-time prompts and alignment strategies that respect task-specific diversity requirements. Overall, the paper promotes task-centric evaluation as essential for robust, safe, and useful LLM deployment.

Abstract

A large language model can be less helpful if it exhibits output response homogenization. But whether two responses are considered homogeneous, and whether such homogenization is problematic, both depend on the task category. For instance, in objective math tasks, we often expect no variation in the final answer but anticipate variation in the problem-solving strategy. Whereas, for creative writing tasks, we may expect variation in key narrative components (e.g. plot, genre, setting, etc), beyond the vocabulary or embedding diversity produced by temperature-sampling. Previous work addressing output homogenization often fails to conceptualize diversity in a task-dependent way. We address this gap in the literature directly by making the following contributions. (1) We present a task taxonomy comprised of eight task categories that each have distinct concepts of output homogenization. (2) We introduce task-anchored functional diversity to better evaluate output homogenization. (3) We propose a task-anchored sampling technique that increases functional diversity for task categories where homogenization is undesired, while preserving it where it is desired. (4) We challenge the perceived existence of a diversity-quality trade-off by increasing functional diversity while maintaining response quality. Overall, we demonstrate how task dependence improves the evaluation and mitigation of output homogenization.

Paper Structure

This paper contains 41 sections, 4 equations, 20 figures, 37 tables.

Figures (20)

  • Figure 1: Our task-anchored sampling technique for improving output homogenization. The first step is to classify each input prompt into a task category. Note that if a prompt falls outside of the taxonomy, our approach can generalize to new task categories, or the model may resume its default behavior. The second step is task-anchored sampling where we clarify the concept of functional diversity in the instruction to generate "different" responses at inference-time. The taxonomy is outlined in § \ref{['sec::task_taxonomy']} and our task-anchored sampling technique is detailed in § \ref{['sec::promote_div']}.
  • Figure 2: Our task-anchored sampling increases functional diversity for task categories where homogenization is undesired, while preserving homogenization where it is desired. We plot the average number of functionally diverse responses generated by GPT-4o for each sampling strategy and task category (with standard error). For the first category (Well-Specified Objective), bars closer to $1$ reflect the preservation of output homogenization that is expected. For all other categories, bars closer to $5$ reflect maximum functional diversity.
  • Figure 3: With task-based metrics, diversity is improved with no significant drop in quality. We plot quality on the $x$-axis and diversity on the $y$-axis and compare the tradeoff under general metrics vs task-based metrics. In (a), there is a large tradeoff between vocabulary diversity (Def. \ref{['def:vocab_div']}) and quality scores determined by a reward model. In (b), there is a negligible tradeoff between task-anchored functional diversity (Def. \ref{['def:fun_div']}) and LLM-judges with task-based grading checklists. Note that the checklist-based quality difference between score $4$ and $5$ is "good" vs "very good". Plots show the mean and standard error of all metrics averaged across all task categories except category A, which we exclude because it is the only category where output homogenization is desired.
  • Figure 4: Heatmaps showing recall for models' task classification (proportion of prompts classified by the model into each task category, conditioned on each ground-truth task category).
  • Figure 5: Number of functionally diverse responses generated by Claude-4-Sonnet (c.f. Figure \ref{['fig:fun_div_gpt']}).
  • ...and 15 more figures

Theorems & Definitions (5)

  • Definition 3.1: Task-Anchored Functional Diversity
  • Definition 3.2: Number of Functionally Unique Responses
  • Definition A.1: Vocabulary Diversity
  • Definition A.2: Embedding Diversity
  • Definition A.3: Compression Diversity