Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning

Jaehun Jung; Seungju Han; Ximing Lu; Skyler Hallinan; David Acuna; Shrimai Prabhumoye; Mostafa Patwary; Mohammad Shoeybi; Bryan Catanzaro; Yejin Choi

Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning

Jaehun Jung, Seungju Han, Ximing Lu, Skyler Hallinan, David Acuna, Shrimai Prabhumoye, Mostafa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi

TL;DR

This work reveals that data diversity in gradient space is a strong predictor of out-of-distribution generalization for LLM reasoning. It introduces G-Vendi, a scalable gradient-based diversity metric that correlates with OOD performance and outperforms traditional diversity proxies. Building on this, Prismatic Synthesis generates diverse, gradient-space-aware synthetic data via clustering and selective sampling, achieving state-of-the-art results on math reasoning and NLI benchmarks with substantially smaller data generators. The findings advocate for principled data diversification as a powerful lever for generalization, alongside careful quality control and scalability considerations.

Abstract

Effective generalization in language models depends critically on the diversity of their training data. Yet existing diversity metrics often fall short of this goal, relying on surface-level heuristics that are decoupled from model behavior. This motivates us to ask: What kind of diversity in training data actually drives generalization in language models -- and how can we measure and amplify it? Through large-scale empirical analyses spanning over 300 training runs, carefully controlled for data scale and quality, we show that data diversity can be a strong predictor of generalization in LLM reasoning -- as measured by average model performance on unseen out-of-distribution benchmarks. We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients. Despite using a small off-the-shelf proxy model for gradients, G-Vendi consistently outperforms alternative measures, achieving strong correlation (Spearman's $ρ\approx 0.9$) with out-of-distribution (OOD) performance on both natural language inference (NLI) and math reasoning tasks. Building on this insight, we present Prismatic Synthesis, a framework for generating diverse synthetic data by targeting underrepresented regions in gradient space. Experimental results show that Prismatic Synthesis consistently improves model performance as we scale synthetic data -- not just on in-distribution test but across unseen, out-of-distribution benchmarks -- significantly outperforming state-of-the-art models that rely on 20 times larger data generator than ours. For example, PrismMath-7B, our model distilled from a 32B LLM, outperforms R1-Distill-Qwen-7B -- the same base model trained on proprietary data generated by 671B R1 -- on 6 out of 7 challenging benchmarks.

Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning

TL;DR

Abstract

Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)