Table of Contents
Fetching ...

Flexible Feature Distillation for Large Language Models

Khouloud Saadi, Di Wang

TL;DR

Flex-KD tackles the bottleneck of feature-level KD for LLMs by removing the need for matched hidden sizes. It identifies task-relevant teacher units via gradient-based importance scores and distills only that subspace into the student, using a correlation-based loss and optional logit KD, enabling effective transfer with $d_S \ll d_T$. Across classification, instruction-following, and summarization, Flex-KD yields consistent gains over linear projection baselines, including robustness in low-data regimes and substantial improvements in summarization (up to 3.75 points RL). This parameter-free, task-driven approach broadens practical LLM compression by allowing flexible teacher–student architectures without distortions from projection layers.

Abstract

Knowledge distillation (KD) has become a cornerstone for compressing large language models (LLMs). However, existing LLM-KD methods have primarily focused on logit-based approaches, which achieve good performance but overlook the rich internal representations of LLMs. Feature-level KD could leverage this structure to provide complementary benefits, yet it remains underexplored because current feature-KD approaches typically assume identical teacher-student hidden sizes, a restrictive and unrealistic assumption. A common workaround is to train a linear projector to align their feature spaces; however, this introduces additional parameters, distorts teacher embeddings, and often degrades downstream performance, especially in generative tasks. We propose Flex-KD, a parameter-free framework for task-driven feature distillation for LLMs. Instead of projecting the entire teacher representation, Flex-KD uses gradient-based scores to identify the most task-relevant dimensions of the teacher's hidden states and distills only this subspace into the student. This ensures that the student's limited capacity is allocated to informative components, while avoiding projector-induced distortion and extra parameters. Flex-KD integrates seamlessly with existing KD pipelines and supports differing teacher-student hidden sizes. Extensive experiments across both classification and generative tasks, i.e., instruction-following and summarization, show that Flex-KD consistently boosts student performance, achieving up to a 3.75 percent performance gain over the linear projection baseline.

Flexible Feature Distillation for Large Language Models

TL;DR

Flex-KD tackles the bottleneck of feature-level KD for LLMs by removing the need for matched hidden sizes. It identifies task-relevant teacher units via gradient-based importance scores and distills only that subspace into the student, using a correlation-based loss and optional logit KD, enabling effective transfer with . Across classification, instruction-following, and summarization, Flex-KD yields consistent gains over linear projection baselines, including robustness in low-data regimes and substantial improvements in summarization (up to 3.75 points RL). This parameter-free, task-driven approach broadens practical LLM compression by allowing flexible teacher–student architectures without distortions from projection layers.

Abstract

Knowledge distillation (KD) has become a cornerstone for compressing large language models (LLMs). However, existing LLM-KD methods have primarily focused on logit-based approaches, which achieve good performance but overlook the rich internal representations of LLMs. Feature-level KD could leverage this structure to provide complementary benefits, yet it remains underexplored because current feature-KD approaches typically assume identical teacher-student hidden sizes, a restrictive and unrealistic assumption. A common workaround is to train a linear projector to align their feature spaces; however, this introduces additional parameters, distorts teacher embeddings, and often degrades downstream performance, especially in generative tasks. We propose Flex-KD, a parameter-free framework for task-driven feature distillation for LLMs. Instead of projecting the entire teacher representation, Flex-KD uses gradient-based scores to identify the most task-relevant dimensions of the teacher's hidden states and distills only this subspace into the student. This ensures that the student's limited capacity is allocated to informative components, while avoiding projector-induced distortion and extra parameters. Flex-KD integrates seamlessly with existing KD pipelines and supports differing teacher-student hidden sizes. Extensive experiments across both classification and generative tasks, i.e., instruction-following and summarization, show that Flex-KD consistently boosts student performance, achieving up to a 3.75 percent performance gain over the linear projection baseline.

Paper Structure

This paper contains 23 sections, 8 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Last-layer activation magnitudes (z-axis) of a fine-tuned GPT-xlarge on a downstream example, with values $<2$ set to zero. The x/y axes denote sequence and features.
  • Figure 2: Overview of Flex-KD.
  • Figure 3: Student model performance on the IMDB dataset as a function of $\alpha$.
  • Figure 4: Activation magnitudes (z-axis) after feeding training samples from the downstream task to a fine-tuned GPT-xlarge. x and y axes are sequence and feature dimensions, respectively: (a) We threshold values below $1$ to zero. (b) We threshold values below $0.5$ to zero. (c) We threshold values below $2$ to zero.
  • Figure 5: Overlap of selected units across 5 random seeds on (Left) XSum dataset and (right) CNN/DailyMail dateset.
  • ...and 2 more figures