Table of Contents
Fetching ...

AdaKD: Dynamic Knowledge Distillation of ASR models using Adaptive Loss Weighting

Shreyan Ganguly, Roshan Nayak, Rakshith Rao, Ujan Deb, Prathosh AP

TL;DR

This paper addresses the inefficiency of fixed, uniform loss weighting in knowledge distillation for ASR. It proposes Adaptive Knowledge Distillation (AdaKD), which dynamically weights the distillation loss at the instance level based on teacher-driven sample difficulty, following a curriculum-inspired progression from easy to hard samples. By computing a difficulty factor from the teacher loss and adjusting the distillation weight with $\alpha = e^{-1/\sqrt{d_f}}$, AdaKD emphasizes learning from easier samples first and gradually incorporates harder cases, improving transfer from large teachers to compact students. Experiments on Whisper and wav2vec2 models across multiple languages and datasets show consistent CER improvements over standard KD and other instance-level losses, though gains can be reduced on very large datasets where fine-tuning can rival distillation. The method is practical, plug-and-play, and highlights a path toward more sample-aware distillation in ASR and beyond, with future work aimed at making hyperparameters learnable and combining with additional efficiency techniques.

Abstract

Knowledge distillation, a widely used model compression technique, works on the basis of transferring knowledge from a cumbersome teacher model to a lightweight student model. The technique involves jointly optimizing the task specific and knowledge distillation losses with a weight assigned to them. Despite these weights playing a crucial role in the performance of the distillation process, current methods provide equal weight to both losses, leading to suboptimal performance. In this paper, we propose Adaptive Knowledge Distillation, a novel technique inspired by curriculum learning to adaptively weigh the losses at instance level. This technique goes by the notion that sample difficulty increases with teacher loss. Our method follows a plug-and-play paradigm that can be applied on top of any task-specific and distillation objectives. Experiments show that our method performs better than conventional knowledge distillation method and existing instance-level loss functions.

AdaKD: Dynamic Knowledge Distillation of ASR models using Adaptive Loss Weighting

TL;DR

This paper addresses the inefficiency of fixed, uniform loss weighting in knowledge distillation for ASR. It proposes Adaptive Knowledge Distillation (AdaKD), which dynamically weights the distillation loss at the instance level based on teacher-driven sample difficulty, following a curriculum-inspired progression from easy to hard samples. By computing a difficulty factor from the teacher loss and adjusting the distillation weight with , AdaKD emphasizes learning from easier samples first and gradually incorporates harder cases, improving transfer from large teachers to compact students. Experiments on Whisper and wav2vec2 models across multiple languages and datasets show consistent CER improvements over standard KD and other instance-level losses, though gains can be reduced on very large datasets where fine-tuning can rival distillation. The method is practical, plug-and-play, and highlights a path toward more sample-aware distillation in ASR and beyond, with future work aimed at making hyperparameters learnable and combining with additional efficiency techniques.

Abstract

Knowledge distillation, a widely used model compression technique, works on the basis of transferring knowledge from a cumbersome teacher model to a lightweight student model. The technique involves jointly optimizing the task specific and knowledge distillation losses with a weight assigned to them. Despite these weights playing a crucial role in the performance of the distillation process, current methods provide equal weight to both losses, leading to suboptimal performance. In this paper, we propose Adaptive Knowledge Distillation, a novel technique inspired by curriculum learning to adaptively weigh the losses at instance level. This technique goes by the notion that sample difficulty increases with teacher loss. Our method follows a plug-and-play paradigm that can be applied on top of any task-specific and distillation objectives. Experiments show that our method performs better than conventional knowledge distillation method and existing instance-level loss functions.
Paper Structure (12 sections, 5 equations, 1 figure, 2 tables, 1 algorithm)

This paper contains 12 sections, 5 equations, 1 figure, 2 tables, 1 algorithm.

Figures (1)

  • Figure 1: A schematic diagram of the proposed knowledge distillation method with dynamic loss weights based on sample difficulty. The teacher model produces an output $y^{t}$ and incurs a loss $T_{l}$. The difficulty factor $d_{f}$ is calculated based on the teacher loss $T_{l}$ and two hyperparameters ($k$, $t$). The loss weight $\alpha$ for distillation is calculated using the difficulty factor $d_{f}$. The student model produces an output $y^{s}$ and incurs a loss $L_{ts}$. The final loss $L_{st}$ is a combination of the student loss $L_{ts}$ and the distillation loss $L_{kd}$.