Table of Contents
Fetching ...

Not All Correct Answers Are Equal: Why Your Distillation Source Matters

Xiaoyu Tian, Yunjie Ji, Haotian Wang, Shuaiting Chen, Sitong Zhao, Yiping Peng, Han Zhao, Xiangang Li

TL;DR

The paper analyzes data distillation for reasoning in open-source LLMs by building a large, parallel corpus of $1.89$ million queries distilled from three teacher models (AM-Thinking-v1, Qwen3-235B-A22B, DeepSeek-R1). It shows that AM-Thinking-v1 distillates yield the strongest performance across math and coding benchmarks, with adaptive generation lengths that scale with task difficulty. Through rigorous preprocessing, verification, and QA, the study links data distribution characteristics—such as token-length diversity and perplexity—to downstream model behavior and accuracy. The authors release the distilled datasets to foster further progress in reasoning-focused, open models and discuss future RL-based enhancements to further improve reasoning capability and alignment.

Abstract

Distillation has emerged as a practical and effective approach to enhance the reasoning capabilities of open-source language models. In this work, we conduct a large-scale empirical study on reasoning data distillation by collecting verified outputs from three state-of-the-art teacher models-AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1-on a shared corpus of 1.89 million queries. We construct three parallel datasets and analyze their distributions, revealing that AM-Thinking-v1-distilled data exhibits greater token length diversity and lower perplexity. Student models trained on each dataset are evaluated on reasoning benchmarks including AIME2024, AIME2025, MATH500, and LiveCodeBench. The model distilled from AM-Thinking-v1 consistently achieves the best performance (e.g., 84.3 on AIME2024, 72.2 on AIME2025, 98.4 on MATH500, and 65.9 on LiveCodeBench) and demonstrates adaptive output behavior-producing longer responses for harder tasks and shorter ones for simpler tasks. These findings highlight the value of high-quality, verified reasoning traces. We release the AM-Thinking-v1 and Qwen3-235B-A22B distilled datasets to support future research on open and high-performing reasoning-oriented language models. The datasets are publicly available on Hugging Face\footnote{Datasets are available on Hugging Face: \href{https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled}{AM-Thinking-v1-Distilled}, \href{https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled}{AM-Qwen3-Distilled}.}.

Not All Correct Answers Are Equal: Why Your Distillation Source Matters

TL;DR

The paper analyzes data distillation for reasoning in open-source LLMs by building a large, parallel corpus of million queries distilled from three teacher models (AM-Thinking-v1, Qwen3-235B-A22B, DeepSeek-R1). It shows that AM-Thinking-v1 distillates yield the strongest performance across math and coding benchmarks, with adaptive generation lengths that scale with task difficulty. Through rigorous preprocessing, verification, and QA, the study links data distribution characteristics—such as token-length diversity and perplexity—to downstream model behavior and accuracy. The authors release the distilled datasets to foster further progress in reasoning-focused, open models and discuss future RL-based enhancements to further improve reasoning capability and alignment.

Abstract

Distillation has emerged as a practical and effective approach to enhance the reasoning capabilities of open-source language models. In this work, we conduct a large-scale empirical study on reasoning data distillation by collecting verified outputs from three state-of-the-art teacher models-AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1-on a shared corpus of 1.89 million queries. We construct three parallel datasets and analyze their distributions, revealing that AM-Thinking-v1-distilled data exhibits greater token length diversity and lower perplexity. Student models trained on each dataset are evaluated on reasoning benchmarks including AIME2024, AIME2025, MATH500, and LiveCodeBench. The model distilled from AM-Thinking-v1 consistently achieves the best performance (e.g., 84.3 on AIME2024, 72.2 on AIME2025, 98.4 on MATH500, and 65.9 on LiveCodeBench) and demonstrates adaptive output behavior-producing longer responses for harder tasks and shorter ones for simpler tasks. These findings highlight the value of high-quality, verified reasoning traces. We release the AM-Thinking-v1 and Qwen3-235B-A22B distilled datasets to support future research on open and high-performing reasoning-oriented language models. The datasets are publicly available on Hugging Face\footnote{Datasets are available on Hugging Face: \href{https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled}{AM-Thinking-v1-Distilled}, \href{https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled}{AM-Qwen3-Distilled}.}.

Paper Structure

This paper contains 15 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Open-source model benchmarks on AIME2024/LiveCodeBench.
  • Figure 2: Instance-level and token-level output distributions are analyzed for AM-Thinkin-v1, Qwen3-235B-A22B, and DeepSeek-R1. The general chat includes both multi-turn conversations and other types of data.
  • Figure 3: Token span distribution of instances for AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1 on math.
  • Figure 4: Token count distributions for AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1 datasets. Box plots show the distribution of token numbers, with means labeled. Qwen3-235B-A22B has the highest average token count, followed by DeepSeek-R1 and AM-Thinking-v1.
  • Figure 5: Perplexity (PPL) distributions for AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1 datasets. Box plots show PPL distributions, with means labeled. AM-Thinking-v1 achieves the lowest mean PPL, indicating better overall quality.
  • ...and 1 more figures