Table of Contents
Fetching ...

Revisiting Intermediate-Layer Matching in Knowledge Distillation: Layer-Selection Strategy Doesn't Matter (Much)

Zony Yu, Yuqiao Wen, Lili Mou

TL;DR

The paper investigates whether layer-selection strategies in intermediate-layer knowledge distillation significantly affect student performance. Through controlled experiments across BERT, BART, T5, and Qwen3 on classification and generation tasks, they compare forward, reverse, all-to-one, and random matching under random and teacher-weight initializations. They find that all strategies yield similar results; intermediate-layer matching remains beneficial over no matching. A geometric interpretation based on acute angles between teacher layers from the student's perspective explains the insensitivity to layer order, and forward matching is recommended as a default when resources are limited.

Abstract

Knowledge distillation (KD) is a popular method of transferring knowledge from a large "teacher" model to a small "student" model. Previous work has explored various layer-selection strategies (e.g., forward matching and in-order random matching) for intermediate-layer matching in KD, where a student layer is forced to resemble a certain teacher layer. In this work, we revisit such layer-selection strategies and observe an intriguing phenomenon that layer-selection strategy does not matter (much) in intermediate-layer matching -- even seemingly nonsensical matching strategies such as reverse matching still result in surprisingly good student performance. We provide an interpretation for this phenomenon by examining the angles between teacher layers viewed from the student's perspective. Our work sheds light on KD practice, as layer-selection strategies may not be the main focus of KD system design, and vanilla forward matching works well in most setups.

Revisiting Intermediate-Layer Matching in Knowledge Distillation: Layer-Selection Strategy Doesn't Matter (Much)

TL;DR

The paper investigates whether layer-selection strategies in intermediate-layer knowledge distillation significantly affect student performance. Through controlled experiments across BERT, BART, T5, and Qwen3 on classification and generation tasks, they compare forward, reverse, all-to-one, and random matching under random and teacher-weight initializations. They find that all strategies yield similar results; intermediate-layer matching remains beneficial over no matching. A geometric interpretation based on acute angles between teacher layers from the student's perspective explains the insensitivity to layer order, and forward matching is recommended as a default when resources are limited.

Abstract

Knowledge distillation (KD) is a popular method of transferring knowledge from a large "teacher" model to a small "student" model. Previous work has explored various layer-selection strategies (e.g., forward matching and in-order random matching) for intermediate-layer matching in KD, where a student layer is forced to resemble a certain teacher layer. In this work, we revisit such layer-selection strategies and observe an intriguing phenomenon that layer-selection strategy does not matter (much) in intermediate-layer matching -- even seemingly nonsensical matching strategies such as reverse matching still result in surprisingly good student performance. We provide an interpretation for this phenomenon by examining the angles between teacher layers viewed from the student's perspective. Our work sheds light on KD practice, as layer-selection strategies may not be the main focus of KD system design, and vanilla forward matching works well in most setups.

Paper Structure

This paper contains 10 sections, 2 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: (a) Illustration of the angle calculation. Cosine similarities are shown for (b) MNLI classification, (c) Encoder in the WMT task, (d) and Decoder in WMT (bottom). Orange refers to the setup of random parameter initialization and blue refers to student weights initialized by the teacher.