Table of Contents
Fetching ...

Escaping Collapse: The Strength of Weak Data for Large Language Model Training

Kareem Amin, Sara Babakniya, Alex Bie, Weiwei Kong, Umar Syed, Sergei Vassilvitskii

TL;DR

<3-5 sentence high-level summary> The paper investigates how to prevent performance collapse when training large language models on synthetically generated data. It introduces a boosting-inspired data-generation framework that leverages synthetic data, a gamma-noisy filter, and a beta-weak labeler to provide exogenous signals, formalizing strong-learning with a mixture of data to guarantee convergence to an optimal LLM. The main theoretical result shows that with positive alpha and beta and appropriate iteration parameters, the final model achieves near-perfect correctness on nearly all prompts, with a convergence rate tied to beta and gamma. Empirical experiments on GSM8K and MBPP validate the theory and demonstrate that focusing labeling resources on the most challenging prompts yields robust improvements, offering practical guidance for data-curation strategies in self-improving LLM pipelines.

Abstract

Synthetically-generated data plays an increasingly larger role in training large language models. However, while synthetic data has been found to be useful, studies have also shown that without proper curation it can cause LLM performance to plateau, or even "collapse", after many training iterations. In this paper, we formalize this question and develop a theoretical framework to investigate how much curation is needed in order to ensure that LLM performance continually improves. Our analysis is inspired by boosting, a classic machine learning technique that leverages a very weak learning algorithm to produce an arbitrarily good classifier. The approach we analyze subsumes many recently proposed methods for training LLMs on synthetic data, and thus our analysis sheds light on why they are successful, and also suggests opportunities for future improvement. We present experiments that validate our theory, and show that dynamically focusing labeling resources on the most challenging examples -- in much the same way that boosting focuses the efforts of the weak learner -- leads to improved performance.

Escaping Collapse: The Strength of Weak Data for Large Language Model Training

TL;DR

<3-5 sentence high-level summary> The paper investigates how to prevent performance collapse when training large language models on synthetically generated data. It introduces a boosting-inspired data-generation framework that leverages synthetic data, a gamma-noisy filter, and a beta-weak labeler to provide exogenous signals, formalizing strong-learning with a mixture of data to guarantee convergence to an optimal LLM. The main theoretical result shows that with positive alpha and beta and appropriate iteration parameters, the final model achieves near-perfect correctness on nearly all prompts, with a convergence rate tied to beta and gamma. Empirical experiments on GSM8K and MBPP validate the theory and demonstrate that focusing labeling resources on the most challenging prompts yields robust improvements, offering practical guidance for data-curation strategies in self-improving LLM pipelines.

Abstract

Synthetically-generated data plays an increasingly larger role in training large language models. However, while synthetic data has been found to be useful, studies have also shown that without proper curation it can cause LLM performance to plateau, or even "collapse", after many training iterations. In this paper, we formalize this question and develop a theoretical framework to investigate how much curation is needed in order to ensure that LLM performance continually improves. Our analysis is inspired by boosting, a classic machine learning technique that leverages a very weak learning algorithm to produce an arbitrarily good classifier. The approach we analyze subsumes many recently proposed methods for training LLMs on synthetic data, and thus our analysis sheds light on why they are successful, and also suggests opportunities for future improvement. We present experiments that validate our theory, and show that dynamically focusing labeling resources on the most challenging examples -- in much the same way that boosting focuses the efforts of the weak learner -- leads to improved performance.

Paper Structure

This paper contains 45 sections, 7 theorems, 24 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Theorem 5

Let $\varepsilon \in (0, 1)$. Suppose that in Algorithm alg:boosting we have $\alpha > 0$, $\beta \in (0, 1)$, $\gamma \in (0, 1]$ and $k \ge ({2\log T + \log |P|})/{(\beta\gamma)}$. With probability at least $1 - 1/T$ over the randomness of the algorithm, the final LLM $g_T$ output by the algorithm satisfies Note that by setting $\alpha = \varepsilon$ in Algorithm alg:boosting the iteration comp

Figures (6)

  • Figure 1: We plot test and train performance of our Algorithm 2 variants on GSM8K, across rounds. We report the mean and np.std(*,ddof=1) for 3 seeds. For train accuracy plots, we plot both train accuracy@1 (solid) and train accuracy@8 (stacked). Boosting results displayed here use weak data (A).
  • Figure 2: We plot test and train performance of our Algorithm 2 variants on MBPP, across rounds. We report the mean and np.std(*,ddof=1) for 3 seeds. For train pass rate plots, we plot both train pass@1 (solid) and train pass@32 (stacked). Boosting results displayed here use weak data (A).
  • Figure 3: Labeler accuracy across rounds. These results use weak data (A). Since training accuracy increases across rounds, the weak labeler gets more queries per question in both cases. Despite this, for Boosting we see that accuracy is relatively constant for GSM8K and decreasing for MBPP. This is because we focus on increasingly harder problems. In Boosting w/o focusing, we observe labeler accuracy increasing because we do not focus labeler efforts on the highest difficulty problems.
  • Figure 4: Average length of responses to GSM8K test set problems across rounds for Boosting experiments.
  • Figure 5: We experiment with off-policy labelers on GSM8K, plotting performance across rounds. Boosting (on-policy) is the setting in all prior experiments, employing Gemma 2 2B PT as the labeler. We see improvement from using the stronger Gemma 7B as our labeler. The weaker Gemma 1 2B performs much worse, but approaches the results from Filter only after 5 rounds.
  • ...and 1 more figures

Theorems & Definitions (18)

  • Definition 1: Strong Learner
  • Definition 2: $\gamma$-noisy Filter
  • Definition 3: $\beta$-weak Labeler
  • Definition 4: Generation
  • Theorem 5
  • proof : Proof sketch
  • Lemma 6
  • proof
  • Lemma 7
  • proof
  • ...and 8 more