Thinking Augmented Pre-training
Liang Wang, Nan Yang, Shaohan Huang, Li Dong, Furu Wei
TL;DR
This work tackles data efficiency in large-language-model pre-training by addressing the limited availability of high-quality data. It introduces Thinking Augmented Pre-Training (TPT), which augments text with automatically generated thinking trajectories produced by open-source LLMs, forming augmented samples x=[d;t] and training with the standard next-token objective. Across experiments up to $100$B tokens and model sizes ranging from $1.5$B to $7$B parameters, TPT delivers substantial improvements in data efficiency (about a factor of $3$) and boosts reasoning benchmarks, with particularly large gains on math and coding tasks for reasoning-intensive data. The results establish TPT as a scalable data-engineering technique that enhances learning efficiency across pre-training, mid-training, and supervised fine-tuning, and point to future directions in scaling data, automatic prompt optimization, and stronger thinking-generation models.
Abstract
This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to $100$B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of $3$. For a $3$B parameter model, it improves the post-training performance by over $10\%$ on several challenging reasoning benchmarks.
