Table of Contents
Fetching ...

Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings

Safal Shrestha, Minwu Kim, Aadim Nepal, Anubhav Shrestha, Keith Ross

TL;DR

The paper tackles data scarcity in training reasoning-capable LLMs by introducing a two-stage framework: a warmup phase that distills general reasoning traces from a toy domain (Knights & Knaves) and a target-domain adaptation phase that applies RLVR on a small set of domain-specific examples. The warmup phase yields broad, cross-domain improvements and effectively acts as a meta-learning prior, while the subsequent RLVR adaptation achieves higher sample efficiency and stronger final performance than direct RLVR on base models across math, coding, and knowledge-intensive tasks. Across MATH, HumanEval+, and MMLU-Pro, the approach maintains or enhances cross-domain generalization, especially when compared to training solely on a single domain. These findings suggest warmup as a practical strategy for building robust, reasoning-capable LLMs in low-resource settings and open avenues for extending the approach to larger models and alternative toy environments.

Abstract

Designing effective reasoning-capable LLMs typically requires training using Reinforcement Learning with Verifiable Rewards (RLVR) or distillation with carefully curated Long Chain of Thoughts (CoT), both of which depend heavily on extensive training data. This creates a major challenge when the amount of quality training data is scarce. We propose a sample-efficient, two-stage training strategy to develop reasoning LLMs under limited supervision. In the first stage, we "warm up" the model by distilling Long CoTs from a toy domain, namely, Knights \& Knaves (K\&K) logic puzzles to acquire general reasoning skills. In the second stage, we apply RLVR to the warmed-up model using a limited set of target-domain examples. Our experiments demonstrate that this two-phase approach offers several benefits: $(i)$ the warmup phase alone facilitates generalized reasoning, leading to performance improvements across a range of tasks, including MATH, HumanEval$^{+}$, and MMLU-Pro; $(ii)$ When both the base model and the warmed-up model are RLVR trained on the same small dataset ($\leq100$ examples), the warmed-up model consistently outperforms the base model; $(iii)$ Warming up before RLVR training allows a model to maintain cross-domain generalizability even after training on a specific domain; $(iv)$ Introducing warmup in the pipeline improves not only accuracy but also overall sample efficiency during RLVR training. The results in this paper highlight the promise of warmup for building robust reasoning LLMs in data-scarce environments.

Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings

TL;DR

The paper tackles data scarcity in training reasoning-capable LLMs by introducing a two-stage framework: a warmup phase that distills general reasoning traces from a toy domain (Knights & Knaves) and a target-domain adaptation phase that applies RLVR on a small set of domain-specific examples. The warmup phase yields broad, cross-domain improvements and effectively acts as a meta-learning prior, while the subsequent RLVR adaptation achieves higher sample efficiency and stronger final performance than direct RLVR on base models across math, coding, and knowledge-intensive tasks. Across MATH, HumanEval+, and MMLU-Pro, the approach maintains or enhances cross-domain generalization, especially when compared to training solely on a single domain. These findings suggest warmup as a practical strategy for building robust, reasoning-capable LLMs in low-resource settings and open avenues for extending the approach to larger models and alternative toy environments.

Abstract

Designing effective reasoning-capable LLMs typically requires training using Reinforcement Learning with Verifiable Rewards (RLVR) or distillation with carefully curated Long Chain of Thoughts (CoT), both of which depend heavily on extensive training data. This creates a major challenge when the amount of quality training data is scarce. We propose a sample-efficient, two-stage training strategy to develop reasoning LLMs under limited supervision. In the first stage, we "warm up" the model by distilling Long CoTs from a toy domain, namely, Knights \& Knaves (K\&K) logic puzzles to acquire general reasoning skills. In the second stage, we apply RLVR to the warmed-up model using a limited set of target-domain examples. Our experiments demonstrate that this two-phase approach offers several benefits: the warmup phase alone facilitates generalized reasoning, leading to performance improvements across a range of tasks, including MATH, HumanEval, and MMLU-Pro; When both the base model and the warmed-up model are RLVR trained on the same small dataset ( examples), the warmed-up model consistently outperforms the base model; Warming up before RLVR training allows a model to maintain cross-domain generalizability even after training on a specific domain; Introducing warmup in the pipeline improves not only accuracy but also overall sample efficiency during RLVR training. The results in this paper highlight the promise of warmup for building robust reasoning LLMs in data-scarce environments.

Paper Structure

This paper contains 33 sections, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Our two stage methodology involving a $(1)$warmup phase and a $(2)$target-task adaptation phase for training reasoning models in resource-constrained settings
  • Figure 2: Absolute percentage increase relative to the base model for MATH, HumanEval+, and MMLU-Pro subsets (Physics & History) for various models: RLVR on base model, warmup only, and RLVR on warmed-up model
  • Figure 3: Generalization results for base+RLVR, warmup-only, and warmup+RLVR models. Top labels indicate RLVR training domain; bottom labels indicate evaluation domain. Warmup uses no domain-specific data.
  • Figure 4: Relative change in completion length (vs. base model) after training on each dataset (y-axis) and evaluating on others (x-axis). Top: base model; Bottom: warmed-up model.
  • Figure 5: Results of Qwen2.5-3B K&K distillation. Loss curve shown on top & performance on MATH500 shown on bottom. Choosing a higher learning rate, $2e-5$ has a chance of overfitting to the K&K domain rather than learning generalizable reasoning behaviors.
  • ...and 4 more figures