LLM-NEO: Parameter Efficient Knowledge Distillation for Large Language Models
Runming Yang, Taiqiang Wu, Jiahao Wang, Pengfei Hu, Yik-Chung Wu, Ngai Wong, Yujiu Yang
TL;DR
LLM-NEO introduces a parameter-efficient knowledge distillation framework that unifies KD and LoRA under a single update paradigm, enabling efficient transfer from teacher to student via a low-rank branch. By modeling updates with $W_t = W_0 + \Delta W_t$ and combining KD's multi-source guidance with LoRA's low-rank constraints, Llm-Neo achieves competitive accuracy while reducing memory and compute. Empirical results on Llama 2 and Llama 3 series show Llm-Neo outperforms KD, LoRA, and SFT, with up to 25% savings in GPU memory and training time, and robustness to LoRA variants like MoSLoRA. The approach scales with data and remains compatible with memory-optimization techniques such as ZeRO, offering practical benefits for deploying compact, capable LLMs.
Abstract
Knowledge distillation (KD) has been a predominant method for compressing Large Language Models (LLMs). In this paper, we first revisit KD and Low-Rank Adaption (LoRA) and demonstrate that they follow the same paradigm. Inspired by this observation, we propose a parameter-efficient knowledge distillation method, LLM-NEO, which integrates LoRA into KD to improve the efficiency of knowledge transfer. After that, we summarize some valuable guidelines for the hyperparameters in LLM-NEO. Experimental results on compressing Llama 2 and Llama 3.2 show that LLM-NEO outperforms various baselines. Further analysis demonstrates the robustness of the proposed LLM-NEO on variants of LoRA. The code and trained models are available at [Github](https://github.com/yang3121099/LLM-Neo).
