Table of Contents
Fetching ...

Thinking Augmented Pre-training

Liang Wang, Nan Yang, Shaohan Huang, Li Dong, Furu Wei

TL;DR

This work tackles data efficiency in large-language-model pre-training by addressing the limited availability of high-quality data. It introduces Thinking Augmented Pre-Training (TPT), which augments text with automatically generated thinking trajectories produced by open-source LLMs, forming augmented samples x=[d;t] and training with the standard next-token objective. Across experiments up to $100$B tokens and model sizes ranging from $1.5$B to $7$B parameters, TPT delivers substantial improvements in data efficiency (about a factor of $3$) and boosts reasoning benchmarks, with particularly large gains on math and coding tasks for reasoning-intensive data. The results establish TPT as a scalable data-engineering technique that enhances learning efficiency across pre-training, mid-training, and supervised fine-tuning, and point to future directions in scaling data, automatic prompt optimization, and stronger thinking-generation models.

Abstract

This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to $100$B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of $3$. For a $3$B parameter model, it improves the post-training performance by over $10\%$ on several challenging reasoning benchmarks.

Thinking Augmented Pre-training

TL;DR

This work tackles data efficiency in large-language-model pre-training by addressing the limited availability of high-quality data. It introduces Thinking Augmented Pre-Training (TPT), which augments text with automatically generated thinking trajectories produced by open-source LLMs, forming augmented samples x=[d;t] and training with the standard next-token objective. Across experiments up to B tokens and model sizes ranging from B to B parameters, TPT delivers substantial improvements in data efficiency (about a factor of ) and boosts reasoning benchmarks, with particularly large gains on math and coding tasks for reasoning-intensive data. The results establish TPT as a scalable data-engineering technique that enhances learning efficiency across pre-training, mid-training, and supervised fine-tuning, and point to future directions in scaling data, automatic prompt optimization, and stronger thinking-generation models.

Abstract

This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of . For a B parameter model, it improves the post-training performance by over on several challenging reasoning benchmarks.

Paper Structure

This paper contains 17 sections, 1 equation, 6 figures, 10 tables.

Figures (6)

  • Figure 1: (a) The average few-shot accuracy scores on the GSM8k and MATH datasets with respect to total training tokens. Both models are pre-trained from scratch with $8$B parameters. One model employed a vanilla next-token prediction objective, while the other utilized thinking-augmented pre-training. (b) Illustration of a thinking augmented data sample. The token in red, "890", is both correct and valuable, yet it is difficult to learn directly. The complete text is provided in Appendix Table \ref{['tab:app_example_1']}.
  • Figure 2: Pre-training loss curves and aggregated scores on $5$ tasks with respect to total training tokens (8B model). Both models are trained from scratch on $100$B tokens. The loss values are not directly comparable due to differences in data distributions, but we demonstrate how thinking augmentation reduces data noise and enhances learnability. The final scores of both models are detailed in Appendix Table \ref{['tab:base_model_performance']}.
  • Figure 3: Task scores with respect to total training tokens (8B model). The tokens in raw documents are constrained to $10$B via random sampling. The final scores are detailed in Appendix Table \ref{['tab:app_base_model_results']}.
  • Figure 4: The average number of thinking tokens, categorized by domain, target audience, and reasoning intensity. The figure lists only the top-$10$ domains that exhibit the longest thinking trajectories.
  • Figure 5: Task scores with respect to the mid-training token budget. The "$0$B" data point corresponds to direct SFT without thinking augmented mid-training.
  • ...and 1 more figures