Table of Contents
Fetching ...

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

DatologyAI, :, Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills, Kaleigh Mentzer, Luke Merrick, Ricardo Monti, Rishabh Adiga, Siddharth Joshi, Spandan Das, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt

TL;DR

The paper investigates the data bottleneck in trillion-token pretraining and compares generator-driven data creation with source rephrasing of web content. It introduces BeyondWeb, a rephrasing-centric framework that diversifies and grounds synthetic data to improve pretraining efficiency across 1B, 3B, and 8B models, achieving substantial accuracy gains and up to 7.7x training speedups. Through a rigorous, multi-faceted evaluation, the authors show that data quality, style alignment with deployment use-cases, and generation diversity jointly drive improvements, and that—even with modest rephraser sizes—high-quality synthetic data can surpass baselines built on larger models. The work establishes a new Pareto frontier for synthetic data, provides actionable insights into seed data selection, rephrasing strategies, and dataset diversity, and outlines future directions for scalable, accessible, and aligned synthetic data generation across domains.

Abstract

Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a promising paradigm for pushing the frontier of performance. Despite this, the factors affecting synthetic data quality remain poorly understood. In this work, we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster training than open web data and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for 180B tokens on BeyondWeb outperforms an 8B model trained for the same token budget on Cosmopedia. We also present several insights from BeyondWeb on synthetic data for pretraining: what drives its benefits, which data to rephrase and how, and the impact of model size and family on data quality. Overall, our work shows that there's no silver bullet for generating high-quality synthetic pretraining data. The best outcomes require jointly optimizing many factors, a challenging task that requires rigorous science and practical expertise. Naive approaches can yield modest improvements, potentially at great cost, while well-executed methods can yield transformative improvements, as exemplified by BeyondWeb.

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

TL;DR

The paper investigates the data bottleneck in trillion-token pretraining and compares generator-driven data creation with source rephrasing of web content. It introduces BeyondWeb, a rephrasing-centric framework that diversifies and grounds synthetic data to improve pretraining efficiency across 1B, 3B, and 8B models, achieving substantial accuracy gains and up to 7.7x training speedups. Through a rigorous, multi-faceted evaluation, the authors show that data quality, style alignment with deployment use-cases, and generation diversity jointly drive improvements, and that—even with modest rephraser sizes—high-quality synthetic data can surpass baselines built on larger models. The work establishes a new Pareto frontier for synthetic data, provides actionable insights into seed data selection, rephrasing strategies, and dataset diversity, and outlines future directions for scalable, accessible, and aligned synthetic data generation across domains.

Abstract

Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a promising paradigm for pushing the frontier of performance. Despite this, the factors affecting synthetic data quality remain poorly understood. In this work, we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster training than open web data and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for 180B tokens on BeyondWeb outperforms an 8B model trained for the same token budget on Cosmopedia. We also present several insights from BeyondWeb on synthetic data for pretraining: what drives its benefits, which data to rephrase and how, and the impact of model size and family on data quality. Overall, our work shows that there's no silver bullet for generating high-quality synthetic pretraining data. The best outcomes require jointly optimizing many factors, a challenging task that requires rigorous science and practical expertise. Naive approaches can yield modest improvements, potentially at great cost, while well-executed methods can yield transformative improvements, as exemplified by BeyondWeb.

Paper Structure

This paper contains 49 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Left:BeyondWeb establishes a new Pareto frontier for synthetic pretraining data. Notably, our 3B model outperforms all but one 8B model trained on baseline datasets with the same token budget. Average Accuracy (%) is the mean across 14 benchmarks. 1B model trained for 1T tokens; 3B and 8B models for 180B tokens. Right: For 8B model, we achieve up to 7.7$\times$ and 2.7$\times$ speedup (in time to reach baseline accuracy) over RedPajama and Nemotron-Synth respectively.
  • Figure 2: Knowledge transfer effectiveness across different synthetic data approaches. The yellow line represents Cosmopedia (46.8%), which uses a generator-driven, sophisticated educational content generation technique and 8x7B model; the cyan line denotes the Summary (46.8%) approach, which uses an 8B model and a simple summarization prompt; the gray line denotes the RPJ-HQ (no synthetic data) baseline. These results demonstrate that even naive summarization achieves substantial improvements similar to those of Cosmopedia, suggesting distillation works through increased information density rather than complex knowledge transfer.
  • Figure 3: Illustration of data splitting and corpus construction strategies to enable a controlled setup. The figure shows how our 20 billion token dataset is divided and utilized across different experimental conditions. The top row displays three data segments: Original 1st Half (10B tokens of natural web content), Original 2nd Half (10B tokens of natural web content), and Continuation (10B tokens of synthetic content generated by extending documents from the first half). The example text snippets demonstrate how continuation generates stylistically consistent but novel content. The arrows below indicate corpus composition: Corpus 1 (Upper Bound) uses both original halves for full natural data coverage; Corpus 2 (2x Repeat) uses only the first half repeated twice; and Corpus 3 (Synthetic Extension) combines the second half with synthetic continuations. This experimental design isolates the effects of repetition versus synthetic augmentation when facing data constraints.
  • Figure 4: Performance comparisons across different data augmentation strategies during training. The dark blue line represents BeyondWeb (50.4%) which significantly surpasses all other approaches. The light blue line shows Continuation (46.2%), the cyan line depicts Full Data Upper Bound (46.2%), and the gray line represents 2x Repeat Lower Bound (45.5%). The striking visual separation emphasizes BeyondWeb 's +4.2pp improvement over the Full Data upper bound. This reflects how intentionality is critical to breaking the data wall with synthetic data, and not just any synthetic data will yield benefits.
  • Figure 5: Performance comparison across different quality combinations in training data. HQ refers to high-quality web data, LQ refers to low-quality web data. The dark blue line shows BeyondWeb (50.4%), dark cyan shows HQ Synth + HQ Web (49.2%), where the synthetic data are rephrased versions of the HQ web samples, and the light cyan line shows LQ Synth + HQ Web (48.6%). The gray baseline corresponds to LQ Web + HQ Web (45.6%). These results indicate that improving the quality of input data for rephrasing improves the rephrased data, even when there is overlap with the original input data. But improved input data quality alone is inadequate for producing the highest quality synthetic data.
  • ...and 6 more figures