Table of Contents
Fetching ...

Data-efficient pre-training by scaling synthetic megadocs

Konwoo Kim, Suhas Kotha, Yejin Choi, Tatsunori Hashimoto, Nick Haber, Percy Liang

Abstract

Synthetic data augmentation has emerged as a promising solution when pre-training is constrained by data rather than compute. We study how to design synthetic data algorithms that achieve better loss scaling: not only lowering loss at finite compute but especially as compute approaches infinity. We first show that pre-training on web data mixed with synthetically generated rephrases improves i.i.d. validation loss on the web data, despite the synthetic data coming from an entirely different distribution. With optimal mixing and epoching, loss and benchmark accuracy improve without overfitting as the number of synthetic generations grows, plateauing near $1.48\times$ data efficiency at 32 rephrases per document. We find even better loss scaling under a new perspective: synthetic generations from the same document can form a single substantially longer megadocument instead of many short documents. We show two ways to construct megadocs: stitching synthetic rephrases from the same web document or stretching a document by inserting rationales. Both methods improve i.i.d. loss, downstream benchmarks, and especially long-context loss relative to simple rephrasing, increasing data efficiency from $1.48\times$ to $1.80\times$ at $32$ generations per document. Importantly, the improvement of megadocs over simple rephrasing widens as more synthetic data is generated. Our results show how to design synthetic data algorithms that benefit more from increasing compute when data-constrained.

Data-efficient pre-training by scaling synthetic megadocs

Abstract

Synthetic data augmentation has emerged as a promising solution when pre-training is constrained by data rather than compute. We study how to design synthetic data algorithms that achieve better loss scaling: not only lowering loss at finite compute but especially as compute approaches infinity. We first show that pre-training on web data mixed with synthetically generated rephrases improves i.i.d. validation loss on the web data, despite the synthetic data coming from an entirely different distribution. With optimal mixing and epoching, loss and benchmark accuracy improve without overfitting as the number of synthetic generations grows, plateauing near data efficiency at 32 rephrases per document. We find even better loss scaling under a new perspective: synthetic generations from the same document can form a single substantially longer megadocument instead of many short documents. We show two ways to construct megadocs: stitching synthetic rephrases from the same web document or stretching a document by inserting rationales. Both methods improve i.i.d. loss, downstream benchmarks, and especially long-context loss relative to simple rephrasing, increasing data efficiency from to at generations per document. Importantly, the improvement of megadocs over simple rephrasing widens as more synthetic data is generated. Our results show how to design synthetic data algorithms that benefit more from increasing compute when data-constrained.
Paper Structure (59 sections, 16 figures, 1 table)

This paper contains 59 sections, 16 figures, 1 table.

Figures (16)

  • Figure 1: Synthetic data by scaling generation count and utilizing megadocs. When training on 200M unique real tokens, the best 300M parameter model with tuned epoching and regularization achieves $3.55$ loss kim2025pre. By mixing synthetically generated rephrases into pre-training, we can improve i.i.d. loss on the original distribution (orange points), monotonically as the number of generations per document grows. We consider two synthetic data algorithms that leverage multiple generations to produce a single megadoc: stitched rephrasing, which concatenates all generations from the same document (blue points), and latent thoughts, which stretches documents by adding rationales (gray points). Both algorithms improve i.i.d. loss and improve scaling in the number of generations per document. The scaling in synthetic token count is even better as the average latent thought is shorter than the average rephrase.
  • Figure 2: Scaling synthetic generations. Left: Given 200M DCLM tokens, we tune a 300M parameter baseline utilizing epoching and regularization which achieves $3.55$ loss (purple point). We then measure the benefit of sampling $G$ rephrases per real doc and searching for locally optimal learning rate, weight decay, epoch count, and mixing fraction (orange points). We find that loss monotonically improves in the number of rephrases generated, appearing to plateau around $32$ generations at loss $3.41$. Right: The loss improvements and plateau are reflected on downstream benchmarks.
  • Figure 3: Synthetic data streams for megadocs. We visualize synthetic streams with 2 real docs and 3 generations. Simple rephrasing (Section \ref{['sec:rephrasing-science']}) permutes all generations and real docs. Stitched rephrasing concatenates all rephrases from the same real doc and either prepends or appends each real doc. Latent thoughts fixes $G$ split points and synthesizes a rationale to derive each suffix from its prefix. Notably, megadocs can exceed the length of the model's context. When training, masking is always disabled across document and megadocument boundaries.
  • Figure 4: Different ways to scale document length. We compare synthetic data algorithms (visualized in Figure \ref{['fig:sorting_diagram']}) given 8 generations per pre-training doc. We run 3 seeds of each algorithm. Both variants of stitched rephrasing outperform simple rephrasing with real last slightly outperforming real first. Latent thoughts further improves both i.i.d and arxiv loss.
  • Figure 5: Scaling generation count for stitching and latent thoughts. We find that stitched rephrasing (blue points) and latent thoughts (gray points) improve i.i.d. loss, long-context loss, and downstream benchmark accuracy over simple rephrasing (orange points). The improvements scale in the number of generations per document and show less signs of plateauing.
  • ...and 11 more figures