Table of Contents
Fetching ...

Reformulation for Pretraining Data Augmentation

Xintong Hao, Ruijie Zhu, Ge Zhang, Ke Shen, Chenggang Li

TL;DR

The paper tackles the data repetition bottleneck in scaling large language models by introducing Massive Genre-Audience (MGA) reformulation, a lightweight data augmentation framework that reformulates existing text into diverse, context-rich variations. MGA generates the MGACorpus, a 770B-token dataset produced via a two-stage reformulation process guided by genre-audience pairs and quality controls based on Limited Consistency, enabling up to 3.9× token expansion with diverse presentations. Empirical results show MGA outperforms data repetition and upsampling across model sizes up to 13B parameters, with notable improvements on reasoning tasks; analyses reveal that prompt engineering and the diversity of reformulations influence performance, while standard validation losses may not fully reflect generalization or learning strategies. The work demonstrates MGA as a practical, scalable pathway to augment pretraining data, alleviating repetition bottlenecks and facilitating more efficient scaling of large language models, with implications for data-efficient training and synthetic data quality assessment.

Abstract

Despite the impressive capabilities of large language models across various tasks, their continued scaling is severely hampered not only by data scarcity but also by the performance degradation associated with excessive data repetition during training. To overcome this critical bottleneck, we propose the Massive Genre-Audience(MGA) reformulation method, a lightweight and scalable data augmentation technique inspired by synthetic data methodologies. MGA systematically reformulates existing corpora into diverse, contextually-rich variations to mitigate the negative effects of repetition, and we introduce this approach along with the resulting 770 billion token MGACorpus in this work. We experimentally validate its core benefit by demonstrating superior performance against data repetition and upsampling in scaling scenarios (up to 13B parameters). Furthermore, comprehensive analysis investigates the role of prompt engineering in generation quality and reveals nuances in evaluating model capabilities using standard loss metrics. Our work shows that MGA provides a reliable pathway to substantially augment training datasets, effectively alleviating repetition bottlenecks and enabling more efficient scaling of large language models.

Reformulation for Pretraining Data Augmentation

TL;DR

The paper tackles the data repetition bottleneck in scaling large language models by introducing Massive Genre-Audience (MGA) reformulation, a lightweight data augmentation framework that reformulates existing text into diverse, context-rich variations. MGA generates the MGACorpus, a 770B-token dataset produced via a two-stage reformulation process guided by genre-audience pairs and quality controls based on Limited Consistency, enabling up to 3.9× token expansion with diverse presentations. Empirical results show MGA outperforms data repetition and upsampling across model sizes up to 13B parameters, with notable improvements on reasoning tasks; analyses reveal that prompt engineering and the diversity of reformulations influence performance, while standard validation losses may not fully reflect generalization or learning strategies. The work demonstrates MGA as a practical, scalable pathway to augment pretraining data, alleviating repetition bottlenecks and facilitating more efficient scaling of large language models, with implications for data-efficient training and synthetic data quality assessment.

Abstract

Despite the impressive capabilities of large language models across various tasks, their continued scaling is severely hampered not only by data scarcity but also by the performance degradation associated with excessive data repetition during training. To overcome this critical bottleneck, we propose the Massive Genre-Audience(MGA) reformulation method, a lightweight and scalable data augmentation technique inspired by synthetic data methodologies. MGA systematically reformulates existing corpora into diverse, contextually-rich variations to mitigate the negative effects of repetition, and we introduce this approach along with the resulting 770 billion token MGACorpus in this work. We experimentally validate its core benefit by demonstrating superior performance against data repetition and upsampling in scaling scenarios (up to 13B parameters). Furthermore, comprehensive analysis investigates the role of prompt engineering in generation quality and reveals nuances in evaluating model capabilities using standard loss metrics. Our work shows that MGA provides a reliable pathway to substantially augment training datasets, effectively alleviating repetition bottlenecks and enabling more efficient scaling of large language models.

Paper Structure

This paper contains 31 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of MGA framework. Our method expands the original corpus through a two-stage synthesis process. Each document is reformulated to 5 new documents, achieving 3.9× token number expansion while maintaining diversity through massive (genre, audience) pairs.
  • Figure 2: t-SNE visualization results. Base (left) maintains a distribution that overlaps with but extends beyond the original data. Strict (middle) clusters also extend original data but indicate limited diversity compared to the Base variant. Relaxed (right) shows significant distributional shift, explaining its poor performance.
  • Figure 3: Training dynamics of two common scenarios under data-constrained conditions: (1) expanding a 50B high-quality dataset to a 500B training budget (entire set repetition), (2) expanding a 500B mixed-quality dataset to a 700B training budget (subset repetition). For data recipe details please refer to \ref{['sec:appd_training']} and benchmark details are provided in Appendix \ref{['sec:appd_scaling_details']}.
  • Figure 4: Benchmark results and validation losses. The sensitivity to data repetition varies across capability domains, with knowledge dimension showing greater resilience.
  • Figure 5: validation losses of experiments in Section \ref{['sec:main_exp']}.
  • ...and 2 more figures