Table of Contents
Fetching ...

Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning

Jaehun Jung, Seungju Han, Ximing Lu, Skyler Hallinan, David Acuna, Shrimai Prabhumoye, Mostafa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi

TL;DR

This work reveals that data diversity in gradient space is a strong predictor of out-of-distribution generalization for LLM reasoning. It introduces G-Vendi, a scalable gradient-based diversity metric that correlates with OOD performance and outperforms traditional diversity proxies. Building on this, Prismatic Synthesis generates diverse, gradient-space-aware synthetic data via clustering and selective sampling, achieving state-of-the-art results on math reasoning and NLI benchmarks with substantially smaller data generators. The findings advocate for principled data diversification as a powerful lever for generalization, alongside careful quality control and scalability considerations.

Abstract

Effective generalization in language models depends critically on the diversity of their training data. Yet existing diversity metrics often fall short of this goal, relying on surface-level heuristics that are decoupled from model behavior. This motivates us to ask: What kind of diversity in training data actually drives generalization in language models -- and how can we measure and amplify it? Through large-scale empirical analyses spanning over 300 training runs, carefully controlled for data scale and quality, we show that data diversity can be a strong predictor of generalization in LLM reasoning -- as measured by average model performance on unseen out-of-distribution benchmarks. We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients. Despite using a small off-the-shelf proxy model for gradients, G-Vendi consistently outperforms alternative measures, achieving strong correlation (Spearman's $ρ\approx 0.9$) with out-of-distribution (OOD) performance on both natural language inference (NLI) and math reasoning tasks. Building on this insight, we present Prismatic Synthesis, a framework for generating diverse synthetic data by targeting underrepresented regions in gradient space. Experimental results show that Prismatic Synthesis consistently improves model performance as we scale synthetic data -- not just on in-distribution test but across unseen, out-of-distribution benchmarks -- significantly outperforming state-of-the-art models that rely on 20 times larger data generator than ours. For example, PrismMath-7B, our model distilled from a 32B LLM, outperforms R1-Distill-Qwen-7B -- the same base model trained on proprietary data generated by 671B R1 -- on 6 out of 7 challenging benchmarks.

Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning

TL;DR

This work reveals that data diversity in gradient space is a strong predictor of out-of-distribution generalization for LLM reasoning. It introduces G-Vendi, a scalable gradient-based diversity metric that correlates with OOD performance and outperforms traditional diversity proxies. Building on this, Prismatic Synthesis generates diverse, gradient-space-aware synthetic data via clustering and selective sampling, achieving state-of-the-art results on math reasoning and NLI benchmarks with substantially smaller data generators. The findings advocate for principled data diversification as a powerful lever for generalization, alongside careful quality control and scalability considerations.

Abstract

Effective generalization in language models depends critically on the diversity of their training data. Yet existing diversity metrics often fall short of this goal, relying on surface-level heuristics that are decoupled from model behavior. This motivates us to ask: What kind of diversity in training data actually drives generalization in language models -- and how can we measure and amplify it? Through large-scale empirical analyses spanning over 300 training runs, carefully controlled for data scale and quality, we show that data diversity can be a strong predictor of generalization in LLM reasoning -- as measured by average model performance on unseen out-of-distribution benchmarks. We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients. Despite using a small off-the-shelf proxy model for gradients, G-Vendi consistently outperforms alternative measures, achieving strong correlation (Spearman's ) with out-of-distribution (OOD) performance on both natural language inference (NLI) and math reasoning tasks. Building on this insight, we present Prismatic Synthesis, a framework for generating diverse synthetic data by targeting underrepresented regions in gradient space. Experimental results show that Prismatic Synthesis consistently improves model performance as we scale synthetic data -- not just on in-distribution test but across unseen, out-of-distribution benchmarks -- significantly outperforming state-of-the-art models that rely on 20 times larger data generator than ours. For example, PrismMath-7B, our model distilled from a 32B LLM, outperforms R1-Distill-Qwen-7B -- the same base model trained on proprietary data generated by 671B R1 -- on 6 out of 7 challenging benchmarks.

Paper Structure

This paper contains 43 sections, 6 equations, 7 figures, 4 tables, 2 algorithms.

Figures (7)

  • Figure 1: (Left) Overview of Prismatic Synthesis. We iteratively (1) cluster samples in a gradient space, (2) generate new samples, and (3) add to the pool only the samples in sparse clusters, consistently improving both the diversity and scale of generated dataset. (Right) Naive scaling of synthetic math data---with no diversification or with a heuristic persona-guided prompting 1billion-personas---faces early saturation, when measuring average performance across 7 distinct benchmarks. Prismatic Synthesis consistently improves model performance beyond 100K, up to million scale synthetic data. See § \ref{['sec:synthetic_data_scaling']} for more details.
  • Figure 2: G-Vendi and model OOD performance. G-Vendi shows a strong log-linear relationship with model performance, when controlling for data scale and quality. In both tasks, models trained with datasets of high G-Vendi tend to generalize better in OOD benchmarks. Plots for baseline measures are shown in § \ref{['app:results_on_baseline_measures']}.
  • Figure 3: (Left) G-Vendi and in-distribution performance. Compared to OOD, ID performance is more heavily dominated by the scale of the training dataset---e.g., 10K datasets with high diversity are less likely to outperform 50K datasets, compared to OOD results in Fig. \ref{['fig:diversity_evaluation_g-vendi']}. But significantly low diversity can still harm in-distribution performance. (Right) Ablation on the student model. Higher G-Vendi correlates with stronger OOD performance, across model family and scale.
  • Figure 3: G-Vendi is stable with proxy models of different sizes and model families. We report rank correlation with the original proxy model and with model OOD performance on math reasoning.
  • Figure 4: Relationship between baseline diversity measures and model OOD performance, measured in math reasoning tasks. Relative OOD accuracy is averaged across 7 benchmarks, following the same process as in § \ref{['sec:evaluating_data_diverstiy_measures']}. Overall, widely-used diversity measures fall behind G-Vendi in their correlation with model performance.
  • ...and 2 more figures