Table of Contents
Fetching ...

DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models

Chengyu Wang, Junbing Yan, Yuanhao Yue, Jun Huang

TL;DR

The paper tackles the resource constraints of deploying large language models by distilling them into smaller, open-source variants. It introduces DistilQwen2.5, a two-stage pipeline that combines black-box, multi-agent data augmentation with CoT-guided rewriting and an efficient white-box model fusion that transfers knowledge from large teachers to smaller students. Evaluations on AlpacaEval 2.0, MT-Bench, and IFEval demonstrate significant instruction-following gains, with the largest improvements for compact backbones, and practical deployments such as SQL completion and cloud KD workflows. Overall, the work provides industrially viable strategies for constructing a spectrum of compact LLMs that achieve strong task performance while reducing inference costs, and it releases the DistilQwen2.5 family as open-source for broader impact.

Abstract

Enhancing computational efficiency and reducing deployment costs for large language models (LLMs) have become critical challenges in various resource-constrained scenarios. In this work, we present DistilQwen2.5, a family of distilled, lightweight LLMs derived from the public Qwen2.5 models. These distilled models exhibit enhanced instruction-following capabilities compared to the original models based on a series of distillation techniques that incorporate knowledge from much larger LLMs. In our industrial practice, we first leverage powerful proprietary LLMs with varying capacities as multi-agent teachers to select, rewrite, and refine instruction-response pairs that are more suitable for student LLMs to learn. After standard fine-tuning, we further leverage a computationally efficient model fusion approach that enables student models to progressively integrate fine-grained hidden knowledge from their teachers. Experimental evaluations demonstrate that the distilled models possess significantly stronger capabilities than their original checkpoints. Additionally, we present use cases to illustrate the applications of our framework in real-world scenarios. To facilitate practical use, we have released all the DistilQwen2.5 models to the open-source community.

DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models

TL;DR

The paper tackles the resource constraints of deploying large language models by distilling them into smaller, open-source variants. It introduces DistilQwen2.5, a two-stage pipeline that combines black-box, multi-agent data augmentation with CoT-guided rewriting and an efficient white-box model fusion that transfers knowledge from large teachers to smaller students. Evaluations on AlpacaEval 2.0, MT-Bench, and IFEval demonstrate significant instruction-following gains, with the largest improvements for compact backbones, and practical deployments such as SQL completion and cloud KD workflows. Overall, the work provides industrially viable strategies for constructing a spectrum of compact LLMs that achieve strong task performance while reducing inference costs, and it releases the DistilQwen2.5 family as open-source for broader impact.

Abstract

Enhancing computational efficiency and reducing deployment costs for large language models (LLMs) have become critical challenges in various resource-constrained scenarios. In this work, we present DistilQwen2.5, a family of distilled, lightweight LLMs derived from the public Qwen2.5 models. These distilled models exhibit enhanced instruction-following capabilities compared to the original models based on a series of distillation techniques that incorporate knowledge from much larger LLMs. In our industrial practice, we first leverage powerful proprietary LLMs with varying capacities as multi-agent teachers to select, rewrite, and refine instruction-response pairs that are more suitable for student LLMs to learn. After standard fine-tuning, we further leverage a computationally efficient model fusion approach that enables student models to progressively integrate fine-grained hidden knowledge from their teachers. Experimental evaluations demonstrate that the distilled models possess significantly stronger capabilities than their original checkpoints. Additionally, we present use cases to illustrate the applications of our framework in real-world scenarios. To facilitate practical use, we have released all the DistilQwen2.5 models to the open-source community.

Paper Structure

This paper contains 20 sections, 2 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Brief comparison between original Qwen2.5 and DistilQwen2.5 models in terms of AlpacaEval 2.0 (length-controlled) and IFEval scores.
  • Figure 2: Functionalities for LLMs/agents used in data augmentation and black-box distillation. Disclaimer: We use the Qwen logo in the figure; however, any LLMs with sufficient capabilities can be used as well.
  • Figure 3: Comparison of the inference speed for logits generation between our approach and the vanilla approach (average seconds per sample).
  • Figure 4: Comparison between various small models (<10B) based on AlpacaEval 2.0 (length-controlled).
  • Figure 5: Comparison between black-box KD and white-box KD with varying teach model sizes after black-box KD, in terms of AlpacaEval 2.0 (length-controlled) and MT-Bench scores (both full and single).
  • ...and 2 more figures