Not All Correct Answers Are Equal: Why Your Distillation Source Matters

Xiaoyu Tian; Yunjie Ji; Haotian Wang; Shuaiting Chen; Sitong Zhao; Yiping Peng; Han Zhao; Xiangang Li

Not All Correct Answers Are Equal: Why Your Distillation Source Matters

Xiaoyu Tian, Yunjie Ji, Haotian Wang, Shuaiting Chen, Sitong Zhao, Yiping Peng, Han Zhao, Xiangang Li

TL;DR

The paper analyzes data distillation for reasoning in open-source LLMs by building a large, parallel corpus of $1.89$ million queries distilled from three teacher models (AM-Thinking-v1, Qwen3-235B-A22B, DeepSeek-R1). It shows that AM-Thinking-v1 distillates yield the strongest performance across math and coding benchmarks, with adaptive generation lengths that scale with task difficulty. Through rigorous preprocessing, verification, and QA, the study links data distribution characteristics—such as token-length diversity and perplexity—to downstream model behavior and accuracy. The authors release the distilled datasets to foster further progress in reasoning-focused, open models and discuss future RL-based enhancements to further improve reasoning capability and alignment.

Abstract

Distillation has emerged as a practical and effective approach to enhance the reasoning capabilities of open-source language models. In this work, we conduct a large-scale empirical study on reasoning data distillation by collecting verified outputs from three state-of-the-art teacher models-AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1-on a shared corpus of 1.89 million queries. We construct three parallel datasets and analyze their distributions, revealing that AM-Thinking-v1-distilled data exhibits greater token length diversity and lower perplexity. Student models trained on each dataset are evaluated on reasoning benchmarks including AIME2024, AIME2025, MATH500, and LiveCodeBench. The model distilled from AM-Thinking-v1 consistently achieves the best performance (e.g., 84.3 on AIME2024, 72.2 on AIME2025, 98.4 on MATH500, and 65.9 on LiveCodeBench) and demonstrates adaptive output behavior-producing longer responses for harder tasks and shorter ones for simpler tasks. These findings highlight the value of high-quality, verified reasoning traces. We release the AM-Thinking-v1 and Qwen3-235B-A22B distilled datasets to support future research on open and high-performing reasoning-oriented language models. The datasets are publicly available on Hugging Face\footnote{Datasets are available on Hugging Face: \href{https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled}{AM-Thinking-v1-Distilled}, \href{https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled}{AM-Qwen3-Distilled}.}.

Not All Correct Answers Are Equal: Why Your Distillation Source Matters

TL;DR

Abstract

Not All Correct Answers Are Equal: Why Your Distillation Source Matters

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)