AdaKD: Dynamic Knowledge Distillation of ASR models using Adaptive Loss Weighting
Shreyan Ganguly, Roshan Nayak, Rakshith Rao, Ujan Deb, Prathosh AP
TL;DR
This paper addresses the inefficiency of fixed, uniform loss weighting in knowledge distillation for ASR. It proposes Adaptive Knowledge Distillation (AdaKD), which dynamically weights the distillation loss at the instance level based on teacher-driven sample difficulty, following a curriculum-inspired progression from easy to hard samples. By computing a difficulty factor from the teacher loss and adjusting the distillation weight with $\alpha = e^{-1/\sqrt{d_f}}$, AdaKD emphasizes learning from easier samples first and gradually incorporates harder cases, improving transfer from large teachers to compact students. Experiments on Whisper and wav2vec2 models across multiple languages and datasets show consistent CER improvements over standard KD and other instance-level losses, though gains can be reduced on very large datasets where fine-tuning can rival distillation. The method is practical, plug-and-play, and highlights a path toward more sample-aware distillation in ASR and beyond, with future work aimed at making hyperparameters learnable and combining with additional efficiency techniques.
Abstract
Knowledge distillation, a widely used model compression technique, works on the basis of transferring knowledge from a cumbersome teacher model to a lightweight student model. The technique involves jointly optimizing the task specific and knowledge distillation losses with a weight assigned to them. Despite these weights playing a crucial role in the performance of the distillation process, current methods provide equal weight to both losses, leading to suboptimal performance. In this paper, we propose Adaptive Knowledge Distillation, a novel technique inspired by curriculum learning to adaptively weigh the losses at instance level. This technique goes by the notion that sample difficulty increases with teacher loss. Our method follows a plug-and-play paradigm that can be applied on top of any task-specific and distillation objectives. Experiments show that our method performs better than conventional knowledge distillation method and existing instance-level loss functions.
