An Empirical Investigation into the Effect of Parameter Choices in Knowledge Distillation
Md Arafat Sultan, Aashka Trivedi, Parul Awasthy, Avirup Sil
TL;DR
Knowledge distillation performance depends strongly on a small set of configuration choices. The authors formalize the KD objective as $L_{KD} = (1 - \alpha) \mathcal{L}_1(q,p^s) + \alpha \mathcal{L}_2(o^t_{\tau_t}, o^s_{\tau_s})$ and systematically study four knobs: using human labels ($\alpha$), the teacher–student distance measure (CE vs MSE), teacher selection strategy ($t_{hs}$ vs $t_{ll}$), and student temperature scaling ($\tau_s$, with $\tau_t$). Through a greedy, approximate grid search over 39 classifier–dataset pairs and multiple model sizes, the study shows that parameter choices yield up to $4.3\%$ improvements in the worst cases and that a single validation-driven configuration (low-loss teacher, CE distance, $\tau_s=1$) performs near the best across many tasks, reducing the risk of catastrophic performance drops. CE-based distillation, a tendency toward a low-loss teacher, and using $\tau_s=1$ emerge as reliable general principles, with sublinear gains expected from extensive tuning. The results provide practical guidance for KD deployment and motivate further systematic exploration of KD parameter effects across broader task suites.
Abstract
We present a large-scale empirical study of how choices of configuration parameters affect performance in knowledge distillation (KD). An example of such a KD parameter is the measure of distance between the predictions of the teacher and the student, common choices for which include the mean squared error (MSE) and the KL-divergence. Although scattered efforts have been made to understand the differences between such options, the KD literature still lacks a systematic study on their general effect on student performance. We take an empirical approach to this question in this paper, seeking to find out the extent to which such choices influence student performance across 13 datasets from 4 NLP tasks and 3 student sizes. We quantify the cost of making sub-optimal choices and identify a single configuration that performs well across the board.
