Revealing the Power of Post-Training for Small Language Models via Knowledge Distillation

Miao Rang; Zhenni Bi; Hang Zhou; Hanting Chen; An Xiao; Tianyu Guo; Kai Han; Xinghao Chen; Yunhe Wang

Revealing the Power of Post-Training for Small Language Models via Knowledge Distillation

Miao Rang, Zhenni Bi, Hang Zhou, Hanting Chen, An Xiao, Tianyu Guo, Kai Han, Xinghao Chen, Yunhe Wang

TL;DR

This work tackles the challenge of deploying high-performance language capabilities on edge devices by proposing a post-training pipeline that combines curriculum-based supervised fine-tuning with offline on-policy knowledge distillation for 1B-scale models. The two-stage SFT learns robust reasoning before fast responses, while offline KD aligns the student with teacher guidance without iterative online loops, yielding notable gains on mathematics, code generation, and multilingual benchmarks. The resulting openPangu Embedded-1B-KD achieves state-of-the-art performance among billion-parameter models and demonstrates practical edge deployment viability on Ascend hardware, highlighting the potential of targeted post-training over mere model scaling. Overall, the approach offers a scalable and efficient path to high-performing, edge-appropriate language models with broad real-world impact.

Abstract

The rapid advancement of large language models (LLMs) has significantly advanced the capabilities of artificial intelligence across various domains. However, their massive scale and high computational costs render them unsuitable for direct deployment in resource-constrained edge environments. This creates a critical need for high-performance small models that can operate efficiently at the edge. Yet, after pre-training alone, these smaller models often fail to meet the performance requirements of complex tasks. To bridge this gap, we introduce a systematic post-training pipeline that efficiently enhances small model accuracy. Our post training pipeline consists of curriculum-based supervised fine-tuning (SFT) and offline on-policy knowledge distillation. The resulting instruction-tuned model achieves state-of-the-art performance among billion-parameter models, demonstrating strong generalization under strict hardware constraints while maintaining competitive accuracy across a variety of tasks. This work provides a practical and efficient solution for developing high-performance language models on Ascend edge devices.

Revealing the Power of Post-Training for Small Language Models via Knowledge Distillation

TL;DR

Abstract

Revealing the Power of Post-Training for Small Language Models via Knowledge Distillation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)