Flexible Feature Distillation for Large Language Models

Khouloud Saadi; Di Wang

Flexible Feature Distillation for Large Language Models

Khouloud Saadi, Di Wang

TL;DR

Flex-KD tackles the bottleneck of feature-level KD for LLMs by removing the need for matched hidden sizes. It identifies task-relevant teacher units via gradient-based importance scores and distills only that subspace into the student, using a correlation-based loss and optional logit KD, enabling effective transfer with $d_S \ll d_T$. Across classification, instruction-following, and summarization, Flex-KD yields consistent gains over linear projection baselines, including robustness in low-data regimes and substantial improvements in summarization (up to 3.75 points RL). This parameter-free, task-driven approach broadens practical LLM compression by allowing flexible teacher–student architectures without distortions from projection layers.

Abstract

Knowledge distillation (KD) has become a cornerstone for compressing large language models (LLMs). However, existing LLM-KD methods have primarily focused on logit-based approaches, which achieve good performance but overlook the rich internal representations of LLMs. Feature-level KD could leverage this structure to provide complementary benefits, yet it remains underexplored because current feature-KD approaches typically assume identical teacher-student hidden sizes, a restrictive and unrealistic assumption. A common workaround is to train a linear projector to align their feature spaces; however, this introduces additional parameters, distorts teacher embeddings, and often degrades downstream performance, especially in generative tasks. We propose Flex-KD, a parameter-free framework for task-driven feature distillation for LLMs. Instead of projecting the entire teacher representation, Flex-KD uses gradient-based scores to identify the most task-relevant dimensions of the teacher's hidden states and distills only this subspace into the student. This ensures that the student's limited capacity is allocated to informative components, while avoiding projector-induced distortion and extra parameters. Flex-KD integrates seamlessly with existing KD pipelines and supports differing teacher-student hidden sizes. Extensive experiments across both classification and generative tasks, i.e., instruction-following and summarization, show that Flex-KD consistently boosts student performance, achieving up to a 3.75 percent performance gain over the linear projection baseline.

Flexible Feature Distillation for Large Language Models

TL;DR

Abstract

Flexible Feature Distillation for Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)