Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data
Syeda Nahida Akter, Shrimai Prabhumoye, Eric Nyberg, Mostofa Patwary, Mohammad Shoeybi, Yejin Choi, Bryan Catanzaro
TL;DR
This work investigates when to inject reasoning data in the training pipeline of LLMs, challenging the view that reasoning is best added only during post-training. By framing data allocation as an optimization over reasoning data injected in pretraining ($ ext{D}_{ ext{res}}^{ ext{PT}}$) and SFT ($ ext{D}_{ ext{res}}^{ ext{SFT}}$), it demonstrates that front-loading reasoning data yields a durable +$19 ext{%}$ average gain and that pretraining diversity boosts foundational capabilities while SFT quality refines them (+$15 ext{%}$). The study further shows that high-quality pretraining data can unlock latent gains during SFT, whereas naive scaling of SFT data can harm performance, and that reinforcement learning amplifies the benefits of the optimal cross-phase strategy, especially on expert tasks (e.g., AIME). Together, these results provide a principled, phase-aware blueprint for data allocation that improves reasoning across domains and tasks beyond standard post-training approaches. The findings reframe the boundary between pretraining and reasoning, offering a scalable strategy to build more capable, generalizable LLMs with compute-efficient data use.
Abstract
The prevailing paradigm for enhancing the reasoning abilities of LLMs revolves around post-training on high-quality, reasoning-intensive data. While emerging literature suggests that reasoning data is increasingly incorporated also during the mid-training stage-a practice that is relatively more proprietary and less openly characterized-the role of such data in pretraining remains unclear. In particular, due to the opaqueness of pretraining corpora in most frontier models, the effect of reasoning data introduced at different phases of pre- and/or post-training is relatively less reported in the scientific literature. This raises several important questions: Is adding reasoning data earlier during pretraining any better than introducing it during post-training? Could earlier inclusion risk overfitting and harm generalization, or instead establish durable foundations that later fine-tuning cannot recover? We conduct the first systematic study of how reasoning data-varying in scale, diversity, and quality-affects LLM performance when introduced at different stages of training. We find that front-loading reasoning data into pretraining is critical (19% avg gain), establishing foundational capabilities that cannot be fully replicated by later-stage SFT, even with more data. We uncover an asymmetric principle for optimal data allocation: pretraining benefits most from broad diversity in reasoning patterns (11% avg gain), while SFT is more sensitive to data quality (15% avg gain). We show that high-quality pretraining data has latent effects, activated only after SFT, and that naively scaling SFT data can be detrimental, washing away the benefits of early reasoning injection. Our results challenge the conventional separation of language modeling and reasoning, providing a principled guide for strategically allocating data across the entire training pipeline to build more capable models.
