An Empirical Investigation into the Effect of Parameter Choices in Knowledge Distillation

Md Arafat Sultan; Aashka Trivedi; Parul Awasthy; Avirup Sil

An Empirical Investigation into the Effect of Parameter Choices in Knowledge Distillation

Md Arafat Sultan, Aashka Trivedi, Parul Awasthy, Avirup Sil

TL;DR

Knowledge distillation performance depends strongly on a small set of configuration choices. The authors formalize the KD objective as $L_{KD} = (1 - \alpha) \mathcal{L}_1(q,p^s) + \alpha \mathcal{L}_2(o^t_{\tau_t}, o^s_{\tau_s})$ and systematically study four knobs: using human labels ($\alpha$), the teacher–student distance measure (CE vs MSE), teacher selection strategy ($t_{hs}$ vs $t_{ll}$), and student temperature scaling ($\tau_s$, with $\tau_t$). Through a greedy, approximate grid search over 39 classifier–dataset pairs and multiple model sizes, the study shows that parameter choices yield up to $4.3\%$ improvements in the worst cases and that a single validation-driven configuration (low-loss teacher, CE distance, $\tau_s=1$) performs near the best across many tasks, reducing the risk of catastrophic performance drops. CE-based distillation, a tendency toward a low-loss teacher, and using $\tau_s=1$ emerge as reliable general principles, with sublinear gains expected from extensive tuning. The results provide practical guidance for KD deployment and motivate further systematic exploration of KD parameter effects across broader task suites.

Abstract

We present a large-scale empirical study of how choices of configuration parameters affect performance in knowledge distillation (KD). An example of such a KD parameter is the measure of distance between the predictions of the teacher and the student, common choices for which include the mean squared error (MSE) and the KL-divergence. Although scattered efforts have been made to understand the differences between such options, the KD literature still lacks a systematic study on their general effect on student performance. We take an empirical approach to this question in this paper, seeking to find out the extent to which such choices influence student performance across 13 datasets from 4 NLP tasks and 3 student sizes. We quantify the cost of making sub-optimal choices and identify a single configuration that performs well across the board.

An Empirical Investigation into the Effect of Parameter Choices in Knowledge Distillation

TL;DR

Knowledge distillation performance depends strongly on a small set of configuration choices. The authors formalize the KD objective as

and systematically study four knobs: using human labels (

), the teacher–student distance measure (CE vs MSE), teacher selection strategy (

), and student temperature scaling (

, with

). Through a greedy, approximate grid search over 39 classifier–dataset pairs and multiple model sizes, the study shows that parameter choices yield up to

improvements in the worst cases and that a single validation-driven configuration (low-loss teacher, CE distance,

) performs near the best across many tasks, reducing the risk of catastrophic performance drops. CE-based distillation, a tendency toward a low-loss teacher, and using

emerge as reliable general principles, with sublinear gains expected from extensive tuning. The results provide practical guidance for KD deployment and motivate further systematic exploration of KD parameter effects across broader task suites.

Abstract

Paper Structure (21 sections, 2 equations, 6 figures, 6 tables)

This paper contains 21 sections, 2 equations, 6 figures, 6 tables.

Introduction
Preliminaries
Methodology
Experimental Results
Conclusion
Evaluation Metrics
Details of Datasets
Text Classification
Reading Comprehension
Named Entity Recognition
Machine Translation
Introduction
Preliminaries
Methodology
Experimental Results
...and 6 more sections

Figures (6)

Figure 1: Relative performance gain with the best kd configuration in our evaluated sample over two baselines. The empirical upper bound of the cost of making a bad parameter choice is $1.6\%$ with a $50\%$ probability and $4.3\%$ with a $90\%$ probability.
Figure 2: Performance differences due to individual kd parameter choices. Choices of all parameters except $\alpha$ can have a non-negligible impact on performance.
Figure 3: Our proposed configuration (green) optimized over validation data performs at least as well as the original test-specific best configurations in $40\%$ of all test cases, outperforming two baselines across the board.
Figure 4: Relative performance gain with the best kd configuration in our evaluated sample over two baselines. The empirical upper bound of the cost of making a bad parameter choice is $1.6\%$ with a $50\%$ probability and $4.3\%$ with a $90\%$ probability.
Figure 5: Performance differences due to individual kd parameter choices. Choices of all parameters except $\alpha$ can have a non-negligible impact on performance.
...and 1 more figures

An Empirical Investigation into the Effect of Parameter Choices in Knowledge Distillation

TL;DR

Abstract

An Empirical Investigation into the Effect of Parameter Choices in Knowledge Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)