Multi-Granularity Semantic Revision for Large Language Model Distillation

Xiaoyu Liu; Yun Zhang; Wei Li; Simiao Li; Xudong Huang; Hanting Chen; Yehui Tang; Jie Hu; Zhiwei Xiong; Yunhe Wang

Multi-Granularity Semantic Revision for Large Language Model Distillation

Xiaoyu Liu, Yun Zhang, Wei Li, Simiao Li, Xudong Huang, Hanting Chen, Yehui Tang, Jie Hu, Zhiwei Xiong, Yunhe Wang

TL;DR

This work tackles knowledge distillation for autoregressive LLMs by addressing generation errors and misaligned semantic signals in student-driven distillation. It introduces Multi-Granularity Semantic Revision (MGSR), a three-level framework consisting of sequence-level SCRG, token-level DAC-KL, and span-level correlation consistency, with an overall objective $L_{overall} = L_{SFT} + L_{DAC-KLD} + L_{span}$. The DAC-KL loss uses a learnable sub-network to adaptively clip high-density semantic regions in the teacher’s output, while SCRG identifies and replaces error tokens and re-generates sequences to improve sample quality and diversity; span priors enforce cross-token semantic consistency. Extensive experiments across model families from $0.1B$ to $13B$ parameters and five instruction-following datasets show that MGSR outperforms state-of-the-art KD methods, enabling smaller students to closely approach or surpass teacher performance and enhancing practical deployment of efficient LLMs.

Abstract

Knowledge distillation plays a key role in compressing the Large Language Models (LLMs), which boosts a small-size student model under large teacher models' guidance. However, existing LLM distillation methods overly rely on student-generated outputs, which may introduce generation errors and misguide the distillation process. Moreover, the distillation loss functions introduced in previous art struggle to align the most informative part due to the complex distribution of LLMs' outputs. To address these problems, we propose a multi-granularity semantic revision method for LLM distillation. At the sequence level, we propose a sequence correction and re-generation (SCRG) strategy. SCRG first calculates the semantic cognitive difference between the teacher and student to detect the error token, then corrects it with the teacher-generated one, and re-generates the sequence to reduce generation errors and enhance generation diversity. At the token level, we design a distribution adaptive clipping Kullback-Leibler (DAC-KL) loss as the distillation objective function. DAC-KL loss exploits a learnable sub-network to adaptively extract semantically dense areas from the teacher's output, avoiding the interference of redundant information in the distillation process. Finally, at the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent, further enhancing the transfer of semantic information. Extensive experiments across different model families with parameters ranging from 0.1B to 13B demonstrate the superiority of our method compared to existing methods.

Multi-Granularity Semantic Revision for Large Language Model Distillation

TL;DR

. The DAC-KL loss uses a learnable sub-network to adaptively clip high-density semantic regions in the teacher’s output, while SCRG identifies and replaces error tokens and re-generates sequences to improve sample quality and diversity; span priors enforce cross-token semantic consistency. Extensive experiments across model families from

parameters and five instruction-following datasets show that MGSR outperforms state-of-the-art KD methods, enabling smaller students to closely approach or surpass teacher performance and enhancing practical deployment of efficient LLMs.

Abstract

Paper Structure (21 sections, 11 equations, 5 figures, 8 tables)

This paper contains 21 sections, 11 equations, 5 figures, 8 tables.

Introduction
Related work
KD for encoder-only language models.
KD for auto-regression large language models.
Preliminary
Multi-Granularity Semantic Revision
Sequence-level correction and re-generation
Token-level DAC-KL loss function
Span-level correlation consistency
Overall Optimization
Experiments
Experimental description
Comparison with state-of-the-art KD methods
Ablation analysis
Conclusion and Limitation
...and 6 more sections

Figures (5)

Figure 1: Knowledge Distillation using Different Sampled Datasets. (a) Traditional KD using a fixed dataset hinton2015distilling. (b) KD using the student-generated dataset, which can be categorized into on-policy based methods agarwal2024policygu2023minillm and the off-policy based method ko2024distillm. (c) Our proposed KD approach, which leverages a sequence correction and re-generation strategy and can be seamlessly integrated with both on-policy and off-policy generation schedules.
Figure 2: The workflow of sequence correction and re-generation strategy.
Figure 3: The workflow of the DAC-KL loss function.
Figure 4: The workflow of the span-level correlation distillation. $\circ$ denotes Hadamard multiplication.
Figure 5: Examples of the probability distribution of the teacher's output are depicted using kernel density estimation. The original distribution is represented by the blue line, while the distribution of the adaptively clipped probability classes is shown by the red line. From this picture, we can observe that the DAC-KL loss constrains the regions of probability distribution with dense semantic knowledge. Enforcing student model to imitate the distribution of these regions can effectively mitigate the training interference caused by low-semantic regions for student models with limited learning capacity.

Multi-Granularity Semantic Revision for Large Language Model Distillation

TL;DR

Abstract

Multi-Granularity Semantic Revision for Large Language Model Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)