Table of Contents
Fetching ...

Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models

Junjie Yang, Junhao Song, Xudong Han, Ziqian Bi, Tianyang Wang, Chia Xin Liang, Xinyuan Song, Yichao Zhang, Qian Niu, Benji Peng, Keyu Chen, Ming Liu

TL;DR

This survey analyzes how knowledge distillation transfers representation from large teachers to smaller students, with emphasis on feature alignment and representation transfer, including large language model settings. It catalogs variants such as PWKD, CAKD, LAD, CAM-based distillation, and attention-based KD, and discusses efficient training, semiparametric foundations, and reproducibility concerns. The review highlights key techniques that improve transfer, such as block-wise logit distillation and frequency attention, while identifying challenges in scalability, data efficiency, and standardized evaluation. It concludes with future directions for robust, transferable KD methods applicable to computer vision and natural language processing.

Abstract

Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various applications including image classification, object detection, language modeling, text classification, and sentiment analysis. Recent innovations in KD methods, such as attention-based approaches, block-wise logit distillation, and decoupling distillation, have notably improved student model performance. These techniques focus on stimulus complexity, attention mechanisms, and global information capture to optimize knowledge transfer. In addition, KD has proven effective in compressing large language models while preserving accuracy, reducing computational overhead, and improving inference speed. This survey synthesizes the latest literature, highlighting key findings, contributions, and future directions in knowledge distillation to provide insights for researchers and practitioners on its evolving role in artificial intelligence and machine learning.

Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models

TL;DR

This survey analyzes how knowledge distillation transfers representation from large teachers to smaller students, with emphasis on feature alignment and representation transfer, including large language model settings. It catalogs variants such as PWKD, CAKD, LAD, CAM-based distillation, and attention-based KD, and discusses efficient training, semiparametric foundations, and reproducibility concerns. The review highlights key techniques that improve transfer, such as block-wise logit distillation and frequency attention, while identifying challenges in scalability, data efficiency, and standardized evaluation. It concludes with future directions for robust, transferable KD methods applicable to computer vision and natural language processing.

Abstract

Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various applications including image classification, object detection, language modeling, text classification, and sentiment analysis. Recent innovations in KD methods, such as attention-based approaches, block-wise logit distillation, and decoupling distillation, have notably improved student model performance. These techniques focus on stimulus complexity, attention mechanisms, and global information capture to optimize knowledge transfer. In addition, KD has proven effective in compressing large language models while preserving accuracy, reducing computational overhead, and improving inference speed. This survey synthesizes the latest literature, highlighting key findings, contributions, and future directions in knowledge distillation to provide insights for researchers and practitioners on its evolving role in artificial intelligence and machine learning.

Paper Structure

This paper contains 23 sections, 5 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: An overview of the Knowledge Distillation (KD) framework, where a large, high-performance teacher model transfers knowledge to a lightweight student model. Soft labels derived from the teacher’s softened logits guide the student’s learning alongside hard labels.
  • Figure 2: Foundations of Knowledge Distillation. The teacher model uses the temperature parameter $T$ to produce softened predictions (soft targets) that capture nuanced information and guide the student model. The student is trained with a loss combining KL divergence between its outputs and the teacher’s soft targets, and cross-entropy with the true labels.
  • Figure 3: Semiparametric framework for Knowledge Distillation. The diagram illustrates how a complex teacher model transfers knowledge to a compact student model through distillation loss, with theoretical guarantees provided by KL-divergence decomposition and semiparametric risk bounds.
  • Figure 4: Timeline of major developments in Knowledge Distillation (KD).
  • Figure 5: Illustration of logit-based knowledge distillation with decoupled knowledge paths. Techniques shown include block-wise distillation, perception calibration, and temperature-based normalization.
  • ...and 2 more figures