Reformulation for Pretraining Data Augmentation
Xintong Hao, Ruijie Zhu, Ge Zhang, Ke Shen, Chenggang Li
TL;DR
The paper tackles the data repetition bottleneck in scaling large language models by introducing Massive Genre-Audience (MGA) reformulation, a lightweight data augmentation framework that reformulates existing text into diverse, context-rich variations. MGA generates the MGACorpus, a 770B-token dataset produced via a two-stage reformulation process guided by genre-audience pairs and quality controls based on Limited Consistency, enabling up to 3.9× token expansion with diverse presentations. Empirical results show MGA outperforms data repetition and upsampling across model sizes up to 13B parameters, with notable improvements on reasoning tasks; analyses reveal that prompt engineering and the diversity of reformulations influence performance, while standard validation losses may not fully reflect generalization or learning strategies. The work demonstrates MGA as a practical, scalable pathway to augment pretraining data, alleviating repetition bottlenecks and facilitating more efficient scaling of large language models, with implications for data-efficient training and synthetic data quality assessment.
Abstract
Despite the impressive capabilities of large language models across various tasks, their continued scaling is severely hampered not only by data scarcity but also by the performance degradation associated with excessive data repetition during training. To overcome this critical bottleneck, we propose the Massive Genre-Audience(MGA) reformulation method, a lightweight and scalable data augmentation technique inspired by synthetic data methodologies. MGA systematically reformulates existing corpora into diverse, contextually-rich variations to mitigate the negative effects of repetition, and we introduce this approach along with the resulting 770 billion token MGACorpus in this work. We experimentally validate its core benefit by demonstrating superior performance against data repetition and upsampling in scaling scenarios (up to 13B parameters). Furthermore, comprehensive analysis investigates the role of prompt engineering in generation quality and reveals nuances in evaluating model capabilities using standard loss metrics. Our work shows that MGA provides a reliable pathway to substantially augment training datasets, effectively alleviating repetition bottlenecks and enabling more efficient scaling of large language models.
