Table of Contents
Fetching ...

Jointly Reinforcing Diversity and Quality in Language Model Generations

Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, Tianlu Wang

TL;DR

DARLING tackles diversity collapse in post-trained LMs by introducing a semantic-diversity signal learned via a classifier and integrating it multiplicatively with a quality reward in online reinforcement learning. By partitioning responses into semantic equivalence classes and computing a Div_d signal, it achieves r_darling = r ⋅ Norm(Div_d) and optimizes a GRPO-based objective, with token-level updates and no standard-deviation normalization. The approach generalizes to non-verifiable tasks (instruction following, creative writing) and verifiable tasks (competition math), delivering simultaneous gains in quality and diversity, and it promotes exploration that improves overall performance. Across multiple model families and sizes, Darling outperforms quality-only baselines and lexical-diversity baselines, demonstrating significant improvements in pass@1 and pass@k and stronger creativity metrics, with ablations confirming the benefits of semantic diversity and multiplicative fusion.

Abstract

Post-training of Large Language Models (LMs) often prioritizes accuracy and helpfulness at the expense of diversity. This creates a tension: while post-training improves response quality, it also sharpens output distributions and reduces the range of ideas, limiting the usefulness of LMs in creative and exploratory tasks such as brainstorming, storytelling, or problem solving. We address this challenge with Diversity-Aware Reinforcement Learning (DARLING), a framework that jointly optimizes for response quality and semantic diversity. At its core, DARLING introduces a learned partition function to measure diversity beyond surface-level lexical variations. This diversity signal is then combined with a quality reward during online reinforcement learning, encouraging models to generate outputs that are both high-quality and distinct. Experiments across multiple model families and sizes show that DARLING generalizes to two regimes: non-verifiable tasks (instruction following and creative writing) and verifiable tasks (competition math). On five benchmarks in the first setting, DARLING consistently outperforms quality-only RL baselines, producing outputs that are simultaneously of higher quality and novelty. In the second setting, DARLING achieves higher pass@1 (solution quality) and pass@k (solution variety). Most strikingly, explicitly optimizing for diversity catalyzes exploration in online RL, which manifests itself as higher-quality responses.

Jointly Reinforcing Diversity and Quality in Language Model Generations

TL;DR

DARLING tackles diversity collapse in post-trained LMs by introducing a semantic-diversity signal learned via a classifier and integrating it multiplicatively with a quality reward in online reinforcement learning. By partitioning responses into semantic equivalence classes and computing a Div_d signal, it achieves r_darling = r ⋅ Norm(Div_d) and optimizes a GRPO-based objective, with token-level updates and no standard-deviation normalization. The approach generalizes to non-verifiable tasks (instruction following, creative writing) and verifiable tasks (competition math), delivering simultaneous gains in quality and diversity, and it promotes exploration that improves overall performance. Across multiple model families and sizes, Darling outperforms quality-only baselines and lexical-diversity baselines, demonstrating significant improvements in pass@1 and pass@k and stronger creativity metrics, with ablations confirming the benefits of semantic diversity and multiplicative fusion.

Abstract

Post-training of Large Language Models (LMs) often prioritizes accuracy and helpfulness at the expense of diversity. This creates a tension: while post-training improves response quality, it also sharpens output distributions and reduces the range of ideas, limiting the usefulness of LMs in creative and exploratory tasks such as brainstorming, storytelling, or problem solving. We address this challenge with Diversity-Aware Reinforcement Learning (DARLING), a framework that jointly optimizes for response quality and semantic diversity. At its core, DARLING introduces a learned partition function to measure diversity beyond surface-level lexical variations. This diversity signal is then combined with a quality reward during online reinforcement learning, encouraging models to generate outputs that are both high-quality and distinct. Experiments across multiple model families and sizes show that DARLING generalizes to two regimes: non-verifiable tasks (instruction following and creative writing) and verifiable tasks (competition math). On five benchmarks in the first setting, DARLING consistently outperforms quality-only RL baselines, producing outputs that are simultaneously of higher quality and novelty. In the second setting, DARLING achieves higher pass@1 (solution quality) and pass@k (solution variety). Most strikingly, explicitly optimizing for diversity catalyzes exploration in online RL, which manifests itself as higher-quality responses.

Paper Structure

This paper contains 30 sections, 15 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Diversity-Aware Reinforcement Learning (Darling): We first partition LLM generations into semantically equivalent clusters (represented by colors). While standard GRPO deepseek-math increases probabilities based on response quality only, Darling amplifies the increase in probability of diverse and high-quality responses.
  • Figure 2: Example of partitioning a group of responses into semantically equivalent subgroups and evaluating diversity. Diversity is calculated as the normalized count of responses that is distinct from a given response.
  • Figure 3: The quality-diversity tradeoff when using different sampling temperatures ($T$) for models (at 8B and 70B scales) trained with standard GRPO and Darling. $X$-axis: Distinct metric in NoveltyBench; $Y$-axis: Reward score used in NoveltyBench measuring quality of responses. Darling (blue) simultaneously achieves better quality (y-axis) and diversity (x-axis) as demonstrated by the improved Pareto fronts on both the 8B and 70B scale.
  • Figure 4: Detailed win rates of the top-3 and the bottom-3 rubrics of Llama-3.1-8B-Instruct trained with Darling against models with similar ELO points. Darling s strength lies in being "Interesting and Original" and "Avoids Cliche" due to being able to generate creative responses.
  • Figure 5: Example generations of Llama-3.3-70B-Instruct before and after Darling training. We sample 4 parallel generations with temperature=1.0 for both models. Models trained with Darling exhibit better diversity.
  • ...and 6 more figures