Table of Contents
Fetching ...

MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes

Changsheng Zhao, Ernie Chang, Zechun Liu, Chia-Jung Chang, Wei Wen, Chen Lai, Sheng Cao, Yuandong Tian, Raghuraman Krishnamoorthi, Yangyang Shi, Vikas Chandra

TL;DR

This work reframes reasoning emergence in LLMs as a data-centric problem, showing that sub-billion models can attain strong reasoning with carefully curated open data and a principled training curriculum. The authors propose a three-stage pipeline (pretraining, mid-training, post-training) combined with hierarchical data curation and cross-capability self-influence to optimize token usage, achieving state-of-the-art reasoning performance for fully open-source sub-1B models. The MobileLLM-R1 family, including a 950M-parameter model, matches or surpasses larger proprietary models on several benchmarks while using only a fraction of their training tokens, and it demonstrates practical on-device reasoning capabilities with favorable latency and memory profiles. The work provides open training recipes, data sources, and checkpoints, offering a concrete, reproducible path toward efficient, on-device reasoning in small LMs with broad impact for NLP applications.

Abstract

The paradigm shift in large language models (LLMs) from instinctive responses to chain-of-thought (CoT) reasoning has fueled two prevailing assumptions: (1) reasoning capabilities only emerge in sufficiently large models, and (2) such capabilities require training on massive datasets. While the first assumption has already been challenged by recent sub-billion-parameter reasoning models such as Qwen3-0.6B and DeepSeek distilled variants, the second remains largely unquestioned. In this work, we revisit the necessity of scaling to extremely large corpora (>10T tokens) for reasoning emergence. By carefully curating and resampling open-source datasets that we identify as beneficial under our designed metrics, we demonstrate that strong reasoning abilities can emerge with far less data. Specifically, we show that only ~2T tokens of high-quality data are sufficient, and pre-training with 4.2T tokens on the dataset resampled from these ~2T tokens, followed by a established post-training procedure, enables the development of MobileLLM-R1, a series of sub-billion-parameter reasoning models that substantially outperform prior models trained on fully open-sourced data. For example, MobileLLM-R1-950M achieves an AIME score of 15.5, compared to just 0.6 for OLMo-2-1.48B and 0.3 for SmolLM-2-1.7B. Remarkably, despite being trained on only 11.7% of the tokens compared to Qwen3's proprietary 36T-token corpus for pretraining, MobileLLM-R1-950M matches or surpasses Qwen3-0.6B across multiple reasoning benchmarks. To facilitate further research in this direction, we have released the complete training recipe, data sources, data mixing ratio, and model checkpoints, together with the key insights obtained throughout this study.

MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes

TL;DR

This work reframes reasoning emergence in LLMs as a data-centric problem, showing that sub-billion models can attain strong reasoning with carefully curated open data and a principled training curriculum. The authors propose a three-stage pipeline (pretraining, mid-training, post-training) combined with hierarchical data curation and cross-capability self-influence to optimize token usage, achieving state-of-the-art reasoning performance for fully open-source sub-1B models. The MobileLLM-R1 family, including a 950M-parameter model, matches or surpasses larger proprietary models on several benchmarks while using only a fraction of their training tokens, and it demonstrates practical on-device reasoning capabilities with favorable latency and memory profiles. The work provides open training recipes, data sources, and checkpoints, offering a concrete, reproducible path toward efficient, on-device reasoning in small LMs with broad impact for NLP applications.

Abstract

The paradigm shift in large language models (LLMs) from instinctive responses to chain-of-thought (CoT) reasoning has fueled two prevailing assumptions: (1) reasoning capabilities only emerge in sufficiently large models, and (2) such capabilities require training on massive datasets. While the first assumption has already been challenged by recent sub-billion-parameter reasoning models such as Qwen3-0.6B and DeepSeek distilled variants, the second remains largely unquestioned. In this work, we revisit the necessity of scaling to extremely large corpora (>10T tokens) for reasoning emergence. By carefully curating and resampling open-source datasets that we identify as beneficial under our designed metrics, we demonstrate that strong reasoning abilities can emerge with far less data. Specifically, we show that only ~2T tokens of high-quality data are sufficient, and pre-training with 4.2T tokens on the dataset resampled from these ~2T tokens, followed by a established post-training procedure, enables the development of MobileLLM-R1, a series of sub-billion-parameter reasoning models that substantially outperform prior models trained on fully open-sourced data. For example, MobileLLM-R1-950M achieves an AIME score of 15.5, compared to just 0.6 for OLMo-2-1.48B and 0.3 for SmolLM-2-1.7B. Remarkably, despite being trained on only 11.7% of the tokens compared to Qwen3's proprietary 36T-token corpus for pretraining, MobileLLM-R1-950M matches or surpasses Qwen3-0.6B across multiple reasoning benchmarks. To facilitate further research in this direction, we have released the complete training recipe, data sources, data mixing ratio, and model checkpoints, together with the key insights obtained throughout this study.

Paper Structure

This paper contains 24 sections, 7 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Pretrained model accuracy vs. training efficiency trade-off.
  • Figure 2: Performance comparison of base models across three tasks: GSM8k, HumanEval, and MMLU. Models are grouped by parameter size and color-coded by model family: $\texttt{MobileLLM-R1}{}$ (purple), SmolLM (orange), OLMo (yellow), and other partially open-source models (gray). Labels indicate model name and size for select models. $\texttt{MobileLLM-R1}{}$ consistently achieves strong performance across tasks while remaining parameter-efficient. A comprehensive comparison is presented in Table \ref{['tab:base_acc']}.
  • Figure 3: Performance comparison of post-trained models across three tasks: MATH, AIME'24, and LiveCodeBench-v6. The full comparison results are provided in Table \ref{['tab:post_acc']}.
  • Figure 4: Overall training pipeline of $\texttt{MobileLLM-R1}{}$.
  • Figure 5: Hierarchical rejection sampling. We employ the FineWeb-Edu classifier in conjunction with the ASK-LLM paradigm to construct a representative subset from each pretraining corpus. Samples are assigned selection scores, and inclusion is determined by thresholding these scores.
  • ...and 5 more figures