Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures

Haohui Wang; Jingyuan Qi; Jianpeng Chen; Jun Wu; Lifu Huang; Lecheng Zheng; Kevin Choi; Balaji Veeramani; Edward Bowen; Alison Hu; Tyler Cody; Dawei Zhou

Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures

Haohui Wang, Jingyuan Qi, Jianpeng Chen, Jun Wu, Lifu Huang, Lecheng Zheng, Kevin Choi, Balaji Veeramani, Edward Bowen, Alison Hu, Tyler Cody, Dawei Zhou

TL;DR

This work investigates how mixing real and synthetic data affects large language model scaling, revealing a three-phase learning pattern with two breakpoints that separate head and tail knowledge acquisition. It derives a neural tangent kernel–based generalization bound for real–synthetic mixtures and leverages it to build a scalable data-valuation framework, implemented via MK-MMD for distribution discrepancy and an initialization NTK term, all retraining-free for efficiency. Empirical results across image classification, sentiment analysis, instruction following, and complex reasoning confirm the proposed theory and show that the valuation method achieves higher correlation with ground-truth data contributions while requiring far less computation than retraining-based baselines. The approach thus offers a principled, scalable way to guide data selection in large-scale real–synthetic regimes, with practical implications for improved generalization and training efficiency.

Abstract

The rapid progress of large language models (LLMs) is fueled by the growing reliance on datasets that blend real and synthetic data. While synthetic data offers scalability and cost-efficiency, it often introduces systematic distributional discrepancies, particularly underrepresenting long-tail knowledge due to truncation effects from data generation mechanisms like top-p sampling, temperature scaling, and finite sampling. These discrepancies pose fundamental challenges in characterizing and evaluating the utility of mixed real-synthetic datasets. In this paper, we identify a three-phase scaling behavior characterized by two breakpoints that reflect transitions in model behavior across learning head and tail knowledge. We further derive an LLM generalization bound designed for real and synthetic mixtures, revealing several key factors that govern their generalization performance. Building on our theoretical findings, we propose an effective yet efficient data valuation method that scales to large-scale datasets. Comprehensive experiments across four tasks, including image classification, sentiment classification, instruction following, and complex reasoning, demonstrate that our method surpasses state-of-the-art baselines in data valuation with significantly low computational cost.

Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures

TL;DR

Abstract

Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (8)