Table of Contents
Fetching ...

Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series

Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang

TL;DR

The paper tackles the need for compact yet capable reasoning models in industry by extending the DistilQwen family with four model series: slow-thinking for high accuracy, two adaptive-thinking families for dynamic reasoning, and distilled reward models for RL with distilled knowledge. It presents an end-to-end pipeline including data collection, CoT data generation and refinement, and curriculum-based training, plus RL integration through GRPO using RV/CD rewards. Evaluations across AIME2024, MATH500, GPQA Diamond, and LiveCodeBench V2 show that adaptive-thinking models deliver strong reasoning performance with favorable efficiency, while reward models improve RL outcomes over baselines. The work demonstrates practical impact through open-source releases and cloud-platform integration on Alibaba Cloud PAI, highlighting how KD-based reasoning can scale to real-world industrial applications.

Abstract

Recently, the demand for small and efficient reasoning models to support real-world applications has driven the development of knowledge distillation techniques that balance reasoning performance and inference speed. In this paper, we further extend the DistilQwen model family, initialized from the Qwen models, by introducing four model series specifically designed to meet industrial requirements. The distilled model collection comprises: (1) slow-thinking models, optimized for reasoning tasks that require high accuracy; (2) two series of adaptive-thinking models, which dynamically adjust reasoning strategies based on input tasks to maximize efficiency across diverse scenarios; and (3) distilled reward models, which enable further reinforcement learning of reasoning models using distilled knowledge. Comprehensive evaluations across multiple benchmarks demonstrate both high inference efficiency and strong reasoning performance for these models, as well as the practical utility of distilled reward models. We further show that these models support industry practitioners by providing scalable training and inference functionalities on the Alibaba Cloud PAI (Platform for Artificial Intelligence) platform.

Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series

TL;DR

The paper tackles the need for compact yet capable reasoning models in industry by extending the DistilQwen family with four model series: slow-thinking for high accuracy, two adaptive-thinking families for dynamic reasoning, and distilled reward models for RL with distilled knowledge. It presents an end-to-end pipeline including data collection, CoT data generation and refinement, and curriculum-based training, plus RL integration through GRPO using RV/CD rewards. Evaluations across AIME2024, MATH500, GPQA Diamond, and LiveCodeBench V2 show that adaptive-thinking models deliver strong reasoning performance with favorable efficiency, while reward models improve RL outcomes over baselines. The work demonstrates practical impact through open-source releases and cloud-platform integration on Alibaba Cloud PAI, highlighting how KD-based reasoning can scale to real-world industrial applications.

Abstract

Recently, the demand for small and efficient reasoning models to support real-world applications has driven the development of knowledge distillation techniques that balance reasoning performance and inference speed. In this paper, we further extend the DistilQwen model family, initialized from the Qwen models, by introducing four model series specifically designed to meet industrial requirements. The distilled model collection comprises: (1) slow-thinking models, optimized for reasoning tasks that require high accuracy; (2) two series of adaptive-thinking models, which dynamically adjust reasoning strategies based on input tasks to maximize efficiency across diverse scenarios; and (3) distilled reward models, which enable further reinforcement learning of reasoning models using distilled knowledge. Comprehensive evaluations across multiple benchmarks demonstrate both high inference efficiency and strong reasoning performance for these models, as well as the practical utility of distilled reward models. We further show that these models support industry practitioners by providing scalable training and inference functionalities on the Alibaba Cloud PAI (Platform for Artificial Intelligence) platform.

Paper Structure

This paper contains 16 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Roadmap for training DistilQwen reasoning and reward models.
  • Figure 2: High-level process for obtaining DistilQwen reasoning and reward models.
  • Figure 3: Performance of DistilQwen2.5-R1 models in terms of $Pass@K$ under multiple inference attempts.
  • Figure 4: Snapshots of the integration of DistilQwen reasoning models with the AI platform.