Table of Contents
Fetching ...

DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training

Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yiping Peng, Yunjie Ji, Han Zhao, Xiangang Li

TL;DR

DeepDistill introduces a large-scale, difficulty-graded reasoning dataset to train LLMs for long-form reasoning. It uses cross-model distillation and a CV-based data selection strategy to identify high-value training instances, reporting strong gains on AIME2024 and related benchmarks. The approach reveals a learning-rate shift for reasoning tasks and demonstrates that supervised fine-tuning on base models with carefully selected data can rival open-source RL-based methods, with open publication of data and training recipes. These results offer a practical pathway to building reasoning-strong open-source LLMs and encourage broader, transparent research in long-form reasoning.

Abstract

Although large language models (LLMs) have recently achieved remarkable performance on various complex reasoning benchmarks, the academic community still lacks an in-depth understanding of base model training processes and data quality. To address this, we construct a large-scale, difficulty-graded reasoning dataset containing approximately 3.34 million unique queries of varying difficulty levels and about 40 million distilled responses generated by multiple models over several passes. Leveraging pass rate and Coefficient of Variation (CV), we precisely select the most valuable training data to enhance reasoning capability. Notably, we observe a training pattern shift, indicating that reasoning-focused training based on base models requires higher learning rates for effective training. Using this carefully selected data, we significantly improve the reasoning capabilities of the base model, achieving a pass rate of 79.2\% on the AIME2024 mathematical reasoning benchmark. This result surpasses most current distilled models and closely approaches state-of-the-art performance. We provide detailed descriptions of our data processing, difficulty assessment, and training methodology, and have publicly released all datasets and methods to promote rapid progress in open-source long-reasoning LLMs. The dataset is available at: \href{https://huggingface.co/datasets/a-m-team/AM-DeepSeek-Distilled-40M}{https://huggingface.co/datasets/a-m-team/AM-DeepSeek-Distilled-40M}

DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training

TL;DR

DeepDistill introduces a large-scale, difficulty-graded reasoning dataset to train LLMs for long-form reasoning. It uses cross-model distillation and a CV-based data selection strategy to identify high-value training instances, reporting strong gains on AIME2024 and related benchmarks. The approach reveals a learning-rate shift for reasoning tasks and demonstrates that supervised fine-tuning on base models with carefully selected data can rival open-source RL-based methods, with open publication of data and training recipes. These results offer a practical pathway to building reasoning-strong open-source LLMs and encourage broader, transparent research in long-form reasoning.

Abstract

Although large language models (LLMs) have recently achieved remarkable performance on various complex reasoning benchmarks, the academic community still lacks an in-depth understanding of base model training processes and data quality. To address this, we construct a large-scale, difficulty-graded reasoning dataset containing approximately 3.34 million unique queries of varying difficulty levels and about 40 million distilled responses generated by multiple models over several passes. Leveraging pass rate and Coefficient of Variation (CV), we precisely select the most valuable training data to enhance reasoning capability. Notably, we observe a training pattern shift, indicating that reasoning-focused training based on base models requires higher learning rates for effective training. Using this carefully selected data, we significantly improve the reasoning capabilities of the base model, achieving a pass rate of 79.2\% on the AIME2024 mathematical reasoning benchmark. This result surpasses most current distilled models and closely approaches state-of-the-art performance. We provide detailed descriptions of our data processing, difficulty assessment, and training methodology, and have publicly released all datasets and methods to promote rapid progress in open-source long-reasoning LLMs. The dataset is available at: \href{https://huggingface.co/datasets/a-m-team/AM-DeepSeek-Distilled-40M}{https://huggingface.co/datasets/a-m-team/AM-DeepSeek-Distilled-40M}

Paper Structure

This paper contains 42 sections, 8 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Benchmark performance of open-source models on AIME2024.
  • Figure 2: Distribution of training data types in Supervised Fine-Tuning (SFT) Stage I. The left pie chart illustrates the proportion at the instance-level, while the right pie chart shows the distribution at the answer token-level.
  • Figure 3: Distribution of training data types in Supervised Fine-Tuning (SFT) Stage II. The left pie chart illustrates the proportion at the instance-level, the right pie chart shows the distribution at the answer token-level.
  • Figure 4: Loss curves of 72B model training.
  • Figure 5: The variations of 32B AIME2024 Score, Generation Stop Ratio, and Average Generated Token Lengths with training steps
  • ...and 3 more figures