Table of Contents
Fetching ...

Towards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective

Zeyu Gan, Yong Liu

TL;DR

The paper addresses the theoretical understanding of synthetic data in LLM post-training by introducing a reverse-bottleneck framework that traces information flow from anchor data through prompts and generators to synthetic data. It defines the information gain $oldsymbol{ riangle}I$ and compression bottleneck $B_{ ext{syn}}$, and derives an information-flow based upper bound on generalization error when training on synthetic data. A key contribution is the Generalization Gain via Mutual Information (GGMI), formalized as $ ext{GGMI} = I(S_{ ext{anchor}},W') - I(S_{ ext{gen}},W)$, with bounds showing how larger $oldsymbol{ riangle}I$ and appropriate entropy terms can improve generalization. The work also provides a GMM-based verification protocol (KL Gap) to illustrate the theory and discusses the trade-off between faithfulness and diversity in synthetic data. Overall, the framework offers theoretical guidance for designing synthetic data generation pipelines and optimizing post-training with respect to generalization performance, while acknowledging practical validation challenges in real-world LLMs.

Abstract

Synthetic data has become a pivotal resource in post-training tasks for large language models (LLMs) due to the scarcity of high-quality, specific data. While various methods have been developed to generate synthetic data, there remains a discernible gap between the practical effects of synthetic data and our theoretical comprehension. To address this challenge, we commence by presenting a detailed modeling of the prevalent synthetic data generation process. Building upon this modeling, we demonstrate that the generalization capability of the post-trained model is critically determined by the information gain derived from the generative model, as analyzed from a novel reverse-bottleneck perspective. Moreover, we introduce the concept of Generalization Gain via Mutual Information (GGMI) and elucidate the relationship between generalization gain and information gain. This analysis serves as a theoretical foundation for synthetic data generation and further highlights its connection with the generalization capability of post-trained models, offering an understanding about the design of synthetic data generation techniques and the optimization of the post-training process. We open-source our code at https://github.com/ZyGan1999/Towards-a-Theoretical-Understanding-of-Synthetic-Data-in-LLM-Post-Training.

Towards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective

TL;DR

The paper addresses the theoretical understanding of synthetic data in LLM post-training by introducing a reverse-bottleneck framework that traces information flow from anchor data through prompts and generators to synthetic data. It defines the information gain and compression bottleneck , and derives an information-flow based upper bound on generalization error when training on synthetic data. A key contribution is the Generalization Gain via Mutual Information (GGMI), formalized as , with bounds showing how larger and appropriate entropy terms can improve generalization. The work also provides a GMM-based verification protocol (KL Gap) to illustrate the theory and discusses the trade-off between faithfulness and diversity in synthetic data. Overall, the framework offers theoretical guidance for designing synthetic data generation pipelines and optimizing post-training with respect to generalization performance, while acknowledging practical validation challenges in real-world LLMs.

Abstract

Synthetic data has become a pivotal resource in post-training tasks for large language models (LLMs) due to the scarcity of high-quality, specific data. While various methods have been developed to generate synthetic data, there remains a discernible gap between the practical effects of synthetic data and our theoretical comprehension. To address this challenge, we commence by presenting a detailed modeling of the prevalent synthetic data generation process. Building upon this modeling, we demonstrate that the generalization capability of the post-trained model is critically determined by the information gain derived from the generative model, as analyzed from a novel reverse-bottleneck perspective. Moreover, we introduce the concept of Generalization Gain via Mutual Information (GGMI) and elucidate the relationship between generalization gain and information gain. This analysis serves as a theoretical foundation for synthetic data generation and further highlights its connection with the generalization capability of post-trained models, offering an understanding about the design of synthetic data generation techniques and the optimization of the post-training process. We open-source our code at https://github.com/ZyGan1999/Towards-a-Theoretical-Understanding-of-Synthetic-Data-in-LLM-Post-Training.
Paper Structure (36 sections, 9 theorems, 49 equations, 5 figures, 4 tables)

This paper contains 36 sections, 9 theorems, 49 equations, 5 figures, 4 tables.

Key Result

Lemma 3.1

Assume that $\pi$ is with a loss function $\ell$ bounded by $C$, given an i.i.d. synthetic dataset $S_\text{gen}$ generated as the above defined, then the following synthetic data training generalization error upper bound holds:

Figures (5)

  • Figure 1: An overview of the synthetic data generation modeling and the relationships between the distributions. (a) The synthetic data generation process and the corresponding distribution compression process. (b) The relationships between the distributions in the generation process.
  • Figure 2: The simulation of the distribution relationships with GMMs. "$\bullet$" represents the anchor data sampled from distributions colored blue, and "$\bullet$" represents the synthetic data sampled from distributions colored orange.
  • Figure 3: Illustration about the reverse bottleneck effect and comparison with classic ML process. Left: the similarity between the forward process of synthetic data generation and classic ML. Right: the difference between the information flow of the two process, where synthetic data generation gains information from $M$, constituting a reverse-bottleneck.
  • Figure 4: KL Gap with different components settings. By default, we set $K=J=L=2$, and vary each of them from $2$ to $15$ to observe the corresponding change of KL Gap. An increase of KL Gap is observed when $J$ increases, while a decrease is observed when $K$ and $L$ increase. The shading indicates the standard deviation of 100 rounds of random settings.
  • Figure 5: Illustration of the setup of the GMMs for simulation.

Theorems & Definitions (21)

  • Lemma 3.1
  • Definition 4.1
  • Definition 4.2
  • Definition 4.3
  • Lemma 4.4
  • Lemma 4.5
  • Lemma 4.6
  • Theorem 4.7
  • Lemma 4.8
  • Definition 4.9
  • ...and 11 more