Knowledge Distillation for Large Language Models

Alejandro Paredes La Torre; Barbara Flores; Diego Rodriguez

Knowledge Distillation for Large Language Models

Alejandro Paredes La Torre, Barbara Flores, Diego Rodriguez

Abstract

We propose a resource-efficient framework for compressing large language models through knowledge distillation, combined with guided chain-of-thought reinforcement learning. Using Qwen 3B as the teacher and Qwen 0.5B as the student, we apply knowledge distillation across English Dolly-15k, Spanish Dolly-15k, and code BugNet and PyTorrent datasets, with hyperparameters tuned in the English setting to optimize student performance. Across tasks, the distilled student retains a substantial portion of the teacher's capability while remaining significantly smaller: 70% to 91% in English, up to 95% in Spanish, and up to 93.5% Rouge-L in code. For coding tasks, integrating chain-of-thought prompting with Group Relative Policy Optimization using CoT-annotated Codeforces data improves reasoning coherence and solution correctness compared to knowledge distillation alone. Post-training 4-bit weight quantization further reduces memory footprint and inference latency. These results show that knowledge distillation combined with chain-of-thought guided reinforcement learning can produce compact, efficient models suitable for deployment in resource-constrained settings.

Knowledge Distillation for Large Language Models

Abstract

Paper Structure (17 sections, 22 equations, 5 figures, 1 table)

This paper contains 17 sections, 22 equations, 5 figures, 1 table.

Introduction
Problem
Objectives
Contributions
Related Work
Context
Methodology
Knowledge Distillation
Chain of Thought
Post-Training Quantization
Experiments
Knowledge Distillation English
Knowledge Distillation Spanish
Knowledge distillation for code correction and code generation
Reinforcement Learning for Chain of Thought
...and 2 more sections

Figures (5)

Figure 1: Distillation process with a bigger SFT model (Teacher) and a smaller model (Student) that adjusts weights on the joint distillation loss. The student model uses the sequence generated by the teacher to learn the distribution of the task.
Figure 2: Performance of the Qwen0.5B model on the English Dolly dataset with varying learning rates.
Figure 3: Performance of the Qwen 0.5B model on Spanish Dolly Multilingual dataset.
Figure 4: Performance of Qwen2 0.5B model vs a SFT baseline on the bugnet dataset, Qwen2 0.5B model vs a SFT baseline on the pytorrent dataset and Qwen2.5 model vs a SFT baseline on the pytorrent dataset
Figure 5: Training sequence of Reinforcement Learning using Chain of thought. The goal is for the model to modify it's original data distribution, short generated sequences, to longer sequences of text with specific keywords and distribution. Shown in the plot are the KL divergence which is going to increase (shift away from short generations) on the left and an increase of the reward functions.

Knowledge Distillation for Large Language Models

Abstract

Knowledge Distillation for Large Language Models

Authors

Abstract

Table of Contents

Figures (5)