Towards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective
Zeyu Gan, Yong Liu
TL;DR
The paper addresses the theoretical understanding of synthetic data in LLM post-training by introducing a reverse-bottleneck framework that traces information flow from anchor data through prompts and generators to synthetic data. It defines the information gain $oldsymbol{ riangle}I$ and compression bottleneck $B_{ ext{syn}}$, and derives an information-flow based upper bound on generalization error when training on synthetic data. A key contribution is the Generalization Gain via Mutual Information (GGMI), formalized as $ ext{GGMI} = I(S_{ ext{anchor}},W') - I(S_{ ext{gen}},W)$, with bounds showing how larger $oldsymbol{ riangle}I$ and appropriate entropy terms can improve generalization. The work also provides a GMM-based verification protocol (KL Gap) to illustrate the theory and discusses the trade-off between faithfulness and diversity in synthetic data. Overall, the framework offers theoretical guidance for designing synthetic data generation pipelines and optimizing post-training with respect to generalization performance, while acknowledging practical validation challenges in real-world LLMs.
Abstract
Synthetic data has become a pivotal resource in post-training tasks for large language models (LLMs) due to the scarcity of high-quality, specific data. While various methods have been developed to generate synthetic data, there remains a discernible gap between the practical effects of synthetic data and our theoretical comprehension. To address this challenge, we commence by presenting a detailed modeling of the prevalent synthetic data generation process. Building upon this modeling, we demonstrate that the generalization capability of the post-trained model is critically determined by the information gain derived from the generative model, as analyzed from a novel reverse-bottleneck perspective. Moreover, we introduce the concept of Generalization Gain via Mutual Information (GGMI) and elucidate the relationship between generalization gain and information gain. This analysis serves as a theoretical foundation for synthetic data generation and further highlights its connection with the generalization capability of post-trained models, offering an understanding about the design of synthetic data generation techniques and the optimization of the post-training process. We open-source our code at https://github.com/ZyGan1999/Towards-a-Theoretical-Understanding-of-Synthetic-Data-in-LLM-Post-Training.
