Smaller but Better: Self-Paced Knowledge Distillation for Lightweight yet Effective LCMs
Yujia Chen, Yang Ye, Zhongqi Li, Yuchi Ma, Cuiyun Gao
TL;DR
SODA introduces a self-paced knowledge distillation framework that adaptively transfers programming capabilities from large, advanced LCMs to lightweight counterparts. By combining correctness-focused supervision with fault-aware contrastive learning, and by employing multi-view feedback (model-based scoring and static tool execution) to guide adaptive seed knowledge updates, SODA achieves substantial improvements across Python, Java, JavaScript, C, C++, Go, and TypeScript benchmarks. The SodaCoder family, built on CodeLlama-7B and DeepseekCoder-6.7B, outperforms 15 LCMs under 16B and even approaches or surpasses ChatGPT on average Pass@1, while delivering favorable training/inference efficiency. The approach demonstrates strong generalization, robust scoring validation, and detailed cost analyses, highlighting practical impact for deploying lightweight yet capable LCMs in real-world code generation tasks.
Abstract
Large code models (LCMs) have remarkably advanced the field of code generation. Despite their impressive capabilities, they still face practical deployment issues, such as high inference costs, limited accessibility of proprietary LCMs, and adaptability issues of ultra-large LCMs. These issues highlight the critical need for more accessible, lightweight yet effective LCMs. Knowledge distillation (KD) offers a promising solution, which transfers the programming capabilities of larger, advanced LCMs to smaller, less powerful LCMs. In this paper, we propose a novel Self-Paced knOwledge DistillAtion framework, named SODA, aiming at developing lightweight yet effective student LCMs. SODA consists of three stages in one cycle: (1) Correct-and-Fault Knowledge Delivery stage aims at improving the student models capability to recognize errors while ensuring its basic programming skill during the knowledge transferring, which involves correctness-aware supervised learning and fault-aware contrastive learning methods. (2) Multi-View Feedback stage aims at measuring the quality of results generated by the student model from two views, including model-based and static tool-based measurement, for identifying the difficult questions. (3) Feedback-based Knowledge Update stage aims at updating the student model adaptively by generating new questions at different difficulty levels, in which the difficulty levels are categorized based on the feedback in the second stage. Experimental results show that SODA improves the student model by 65.96% in terms of average Pass@1, outperforming the best baseline by 29.85%. Based on the SODA framework, we develop SodaCoder, a series of lightweight yet effective LCMs, which outperform 15 LCMs with less than or equal to 16B parameters. Notably, SodaCoder-DS-6.7B, built on DeepseekCoder-6.7B, even surpasses the prominent ChatGPT on average Pass@1.
