Table of Contents
Fetching ...

LLM-NEO: Parameter Efficient Knowledge Distillation for Large Language Models

Runming Yang, Taiqiang Wu, Jiahao Wang, Pengfei Hu, Yik-Chung Wu, Ngai Wong, Yujiu Yang

TL;DR

LLM-NEO introduces a parameter-efficient knowledge distillation framework that unifies KD and LoRA under a single update paradigm, enabling efficient transfer from teacher to student via a low-rank branch. By modeling updates with $W_t = W_0 + \Delta W_t$ and combining KD's multi-source guidance with LoRA's low-rank constraints, Llm-Neo achieves competitive accuracy while reducing memory and compute. Empirical results on Llama 2 and Llama 3 series show Llm-Neo outperforms KD, LoRA, and SFT, with up to 25% savings in GPU memory and training time, and robustness to LoRA variants like MoSLoRA. The approach scales with data and remains compatible with memory-optimization techniques such as ZeRO, offering practical benefits for deploying compact, capable LLMs.

Abstract

Knowledge distillation (KD) has been a predominant method for compressing Large Language Models (LLMs). In this paper, we first revisit KD and Low-Rank Adaption (LoRA) and demonstrate that they follow the same paradigm. Inspired by this observation, we propose a parameter-efficient knowledge distillation method, LLM-NEO, which integrates LoRA into KD to improve the efficiency of knowledge transfer. After that, we summarize some valuable guidelines for the hyperparameters in LLM-NEO. Experimental results on compressing Llama 2 and Llama 3.2 show that LLM-NEO outperforms various baselines. Further analysis demonstrates the robustness of the proposed LLM-NEO on variants of LoRA. The code and trained models are available at [Github](https://github.com/yang3121099/LLM-Neo).

LLM-NEO: Parameter Efficient Knowledge Distillation for Large Language Models

TL;DR

LLM-NEO introduces a parameter-efficient knowledge distillation framework that unifies KD and LoRA under a single update paradigm, enabling efficient transfer from teacher to student via a low-rank branch. By modeling updates with and combining KD's multi-source guidance with LoRA's low-rank constraints, Llm-Neo achieves competitive accuracy while reducing memory and compute. Empirical results on Llama 2 and Llama 3 series show Llm-Neo outperforms KD, LoRA, and SFT, with up to 25% savings in GPU memory and training time, and robustness to LoRA variants like MoSLoRA. The approach scales with data and remains compatible with memory-optimization techniques such as ZeRO, offering practical benefits for deploying compact, capable LLMs.

Abstract

Knowledge distillation (KD) has been a predominant method for compressing Large Language Models (LLMs). In this paper, we first revisit KD and Low-Rank Adaption (LoRA) and demonstrate that they follow the same paradigm. Inspired by this observation, we propose a parameter-efficient knowledge distillation method, LLM-NEO, which integrates LoRA into KD to improve the efficiency of knowledge transfer. After that, we summarize some valuable guidelines for the hyperparameters in LLM-NEO. Experimental results on compressing Llama 2 and Llama 3.2 show that LLM-NEO outperforms various baselines. Further analysis demonstrates the robustness of the proposed LLM-NEO on variants of LoRA. The code and trained models are available at [Github](https://github.com/yang3121099/LLM-Neo).

Paper Structure

This paper contains 22 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Illustration of different knowledge transfer pipelines (KD, LoRA, and Llm-Neo). The proposed Llm-Neo pipeline combines the benefits of both the KD and LoRA approaches, i.e., distilling knowledge from the teacher and low-rank branch efficiency.
  • Figure 2: Grid search results from Llama 2 to TinyLlama for rank (2 to 256). The score matrix shows the average of 10 reasoning metrics, with darker colors indicating better performance.
  • Figure 3: Normalized performance on 10 benchmarks for Llama 3.2-1B Instruct model before and after distillation via Llm-Neo.
  • Figure 4: Comparison of vanilla Llm-Neo and Llm-Neo-MoSLoRA on MMLU and PIQA.
  • Figure 5: Performance with more tokens range increased from $10^6$ to $10^8$.
  • ...and 1 more figures