Table of Contents
Fetching ...

1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training

Han Zhao, Haotian Wang, Yiping Peng, Sitong Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji, Xiangang Li

TL;DR

Addressing the challenge of scalable long-context reasoning, the paper introduces AM-DeepSeek-R1-Distilled, a 1.4M-entry dataset with reasoning traces built from open-source data and DeepSeek-R1 distillations. It describes a three-stage pipeline—Raw Data Collection, Distilling, and Rejection Sampling—plus semantic deduplication and rigorous ground-truth verification to ensure data quality. Experiments show that SFT on Qwen-2.5-32B (AM-Distill-Qwen-32B) and Qwen-2.5-72B (AM-Distill-Qwen-72B) outperform prior distillations on four benchmarks (AIME2024, MATH-500, GPQA-Diamond, LiveCodeBench). The dataset is released to the research community to foster development of reasoning-oriented LLMs and advance AGI-related capabilities.

Abstract

The AM-DeepSeek-R1-Distilled is a large-scale dataset with thinking traces for general reasoning tasks, composed of high-quality and challenging reasoning problems. These problems are collected from a multitude of open-source datasets, subjected to semantic deduplication and meticulous cleaning to eliminate test set contamination. All responses within the dataset are distilled from reasoning models (predominantly DeepSeek-R1) and have undergone rigorous verification procedures. Mathematical problems are validated by checking against reference answers, code problems are verified using test cases, and other tasks are evaluated with the aid of a reward model. The AM-Distill-Qwen-32B model, which was trained through only simple Supervised Fine-Tuning (SFT) using this batch of data, outperformed the DeepSeek-R1-Distill-Qwen-32B model on four benchmarks: AIME2024, MATH-500, GPQA-Diamond, and LiveCodeBench. Additionally, the AM-Distill-Qwen-72B model surpassed the DeepSeek-R1-Distill-Llama-70B model on all benchmarks as well. We are releasing these 1.4 million problems and their corresponding responses to the research community with the objective of fostering the development of powerful reasoning-oriented Large Language Models (LLMs). The dataset was published in \href{https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M}{https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M}.

1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training

TL;DR

Addressing the challenge of scalable long-context reasoning, the paper introduces AM-DeepSeek-R1-Distilled, a 1.4M-entry dataset with reasoning traces built from open-source data and DeepSeek-R1 distillations. It describes a three-stage pipeline—Raw Data Collection, Distilling, and Rejection Sampling—plus semantic deduplication and rigorous ground-truth verification to ensure data quality. Experiments show that SFT on Qwen-2.5-32B (AM-Distill-Qwen-32B) and Qwen-2.5-72B (AM-Distill-Qwen-72B) outperform prior distillations on four benchmarks (AIME2024, MATH-500, GPQA-Diamond, LiveCodeBench). The dataset is released to the research community to foster development of reasoning-oriented LLMs and advance AGI-related capabilities.

Abstract

The AM-DeepSeek-R1-Distilled is a large-scale dataset with thinking traces for general reasoning tasks, composed of high-quality and challenging reasoning problems. These problems are collected from a multitude of open-source datasets, subjected to semantic deduplication and meticulous cleaning to eliminate test set contamination. All responses within the dataset are distilled from reasoning models (predominantly DeepSeek-R1) and have undergone rigorous verification procedures. Mathematical problems are validated by checking against reference answers, code problems are verified using test cases, and other tasks are evaluated with the aid of a reward model. The AM-Distill-Qwen-32B model, which was trained through only simple Supervised Fine-Tuning (SFT) using this batch of data, outperformed the DeepSeek-R1-Distill-Qwen-32B model on four benchmarks: AIME2024, MATH-500, GPQA-Diamond, and LiveCodeBench. Additionally, the AM-Distill-Qwen-72B model surpassed the DeepSeek-R1-Distill-Llama-70B model on all benchmarks as well. We are releasing these 1.4 million problems and their corresponding responses to the research community with the objective of fostering the development of powerful reasoning-oriented Large Language Models (LLMs). The dataset was published in \href{https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M}{https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M}.

Paper Structure

This paper contains 28 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 2: Construction process of data pipeline.
  • Figure 3: Token length distribution of data entries in the dataset. Most data entries contain fewer than 4096 tokens, with the highest concentration around approximately 2048 tokens. The distribution gradually decreases as the token count increases, indicating fewer samples with longer contexts.
  • Figure 4: Distribution of reference answers and test cases in the dataset. Among the entries, 38.9% have reference answers, 21.9% include test cases, and 39.2% have neither reference answers nor test cases.
  • Figure 5: Distribution of data entries across different task categories. The dataset primarily consists of Math (29.3%), Coding (24.3%), and Information Seeking (22.2%) tasks, followed by Reasoning (10.4%), Planning (2.3%), Creative Writing (2.2%), and other combined categories (9.3%).
  • Figure 6: Difficulty distribution of the data entries. Most of the dataset entries are classified as Medium (51.8%) or Hard (25.7%). A smaller proportion falls into the Easy (11.2%), Very Hard (6.5%), and Very Easy (4.7%) categories.