Table of Contents
Fetching ...

Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures

Haohui Wang, Jingyuan Qi, Jianpeng Chen, Jun Wu, Lifu Huang, Lecheng Zheng, Kevin Choi, Balaji Veeramani, Edward Bowen, Alison Hu, Tyler Cody, Dawei Zhou

TL;DR

This work investigates how mixing real and synthetic data affects large language model scaling, revealing a three-phase learning pattern with two breakpoints that separate head and tail knowledge acquisition. It derives a neural tangent kernel–based generalization bound for real–synthetic mixtures and leverages it to build a scalable data-valuation framework, implemented via MK-MMD for distribution discrepancy and an initialization NTK term, all retraining-free for efficiency. Empirical results across image classification, sentiment analysis, instruction following, and complex reasoning confirm the proposed theory and show that the valuation method achieves higher correlation with ground-truth data contributions while requiring far less computation than retraining-based baselines. The approach thus offers a principled, scalable way to guide data selection in large-scale real–synthetic regimes, with practical implications for improved generalization and training efficiency.

Abstract

The rapid progress of large language models (LLMs) is fueled by the growing reliance on datasets that blend real and synthetic data. While synthetic data offers scalability and cost-efficiency, it often introduces systematic distributional discrepancies, particularly underrepresenting long-tail knowledge due to truncation effects from data generation mechanisms like top-p sampling, temperature scaling, and finite sampling. These discrepancies pose fundamental challenges in characterizing and evaluating the utility of mixed real-synthetic datasets. In this paper, we identify a three-phase scaling behavior characterized by two breakpoints that reflect transitions in model behavior across learning head and tail knowledge. We further derive an LLM generalization bound designed for real and synthetic mixtures, revealing several key factors that govern their generalization performance. Building on our theoretical findings, we propose an effective yet efficient data valuation method that scales to large-scale datasets. Comprehensive experiments across four tasks, including image classification, sentiment classification, instruction following, and complex reasoning, demonstrate that our method surpasses state-of-the-art baselines in data valuation with significantly low computational cost.

Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures

TL;DR

This work investigates how mixing real and synthetic data affects large language model scaling, revealing a three-phase learning pattern with two breakpoints that separate head and tail knowledge acquisition. It derives a neural tangent kernel–based generalization bound for real–synthetic mixtures and leverages it to build a scalable data-valuation framework, implemented via MK-MMD for distribution discrepancy and an initialization NTK term, all retraining-free for efficiency. Empirical results across image classification, sentiment analysis, instruction following, and complex reasoning confirm the proposed theory and show that the valuation method achieves higher correlation with ground-truth data contributions while requiring far less computation than retraining-based baselines. The approach thus offers a principled, scalable way to guide data selection in large-scale real–synthetic regimes, with practical implications for improved generalization and training efficiency.

Abstract

The rapid progress of large language models (LLMs) is fueled by the growing reliance on datasets that blend real and synthetic data. While synthetic data offers scalability and cost-efficiency, it often introduces systematic distributional discrepancies, particularly underrepresenting long-tail knowledge due to truncation effects from data generation mechanisms like top-p sampling, temperature scaling, and finite sampling. These discrepancies pose fundamental challenges in characterizing and evaluating the utility of mixed real-synthetic datasets. In this paper, we identify a three-phase scaling behavior characterized by two breakpoints that reflect transitions in model behavior across learning head and tail knowledge. We further derive an LLM generalization bound designed for real and synthetic mixtures, revealing several key factors that govern their generalization performance. Building on our theoretical findings, we propose an effective yet efficient data valuation method that scales to large-scale datasets. Comprehensive experiments across four tasks, including image classification, sentiment classification, instruction following, and complex reasoning, demonstrate that our method surpasses state-of-the-art baselines in data valuation with significantly low computational cost.

Paper Structure

This paper contains 19 sections, 5 theorems, 28 equations, 9 figures, 4 tables, 1 algorithm.

Key Result

Lemma 1

Consider training data where the probability of knowledge $i$ is $q_i=\pi p_i + (1-\pi)p^{\prime}_i$, where $p_i\propto i^{-\beta}$ and $p^{\prime}_i$ is cut off at rank $k$ as defined above. The test error $\mathcal{L}_{\text{test}}$ exhibits distinct scaling regimes characterized by two breakpoint Phase 2 (Plateau): $c_1 k^{\beta} < |\bm{S}| < c_2 k^{\beta}/\pi$, where $c_2$ is absolute constant

Figures (9)

  • Figure 1: The real-world knowledge follows a long-tail distribution (illustrated with the greatest common divisor task charton2023can). Synthetic data is often sampled only from the head knowledge, leading to a truncated tail.
  • Figure 2: Fine-grained three-phase scaling behavior on real and synthetic mixtures, illustrated with the greatest common divisor task charton2023can.
  • Figure 3: Three-phase scaling behavior with two breakpoints on real–synthetic mixtures, for the same task as Figure \ref{['fig:motivation']}.
  • Figure 4: Model accuracy as the increase of training size $|\bm{S}|$ on CIFAR-100, under a long-tail class distribution. Dashed grey lines mark predicted transition breakpoints at $|\bm{S}| = k^{\beta}$ (left) and $|\bm{S}| = k^{\beta}/\pi$ (right).
  • Figure 5: Test loss as the increase of training size $|\bm{S}|$ on CIFAR-100, under a long-tail class distribution. Dashed grey lines mark predicted transition breakpoints at $|\bm{S}| = k^{\beta}$ (left) and $|\bm{S}| = k^{\beta}/\pi$ (right).
  • ...and 4 more figures

Theorems & Definitions (8)

  • Lemma 1: Scaling Behavior with Three phases
  • Theorem 1: LLM Generalization Bound under Real and Synthetic Mixtures
  • Definition 1
  • Lemma 2
  • Theorem 1: LLM Generalization Bound under Real and Synthetic Mixtures
  • proof
  • Lemma 2: Scaling Behavior with Three phases
  • proof