Table of Contents
Fetching ...

Dual Knowledge Distillation for Efficient Sound Event Detection

Yang Xiao, Rohan Kumar Das

TL;DR

This work tackles the challenge of running sound event detection on resource-constrained devices by introducing a dual knowledge distillation framework that combines Temporal-Averaging Knowledge Distillation (TAKD) with Embedding-Enhanced Feature Distillation (EEFD). The SE-CRNN backbone is compacted and augmented with SE and tfwSE modules to maintain performance with fewer parameters. TAKD enables indirect learning from a pre-trained teacher through a mean student derived from EMA, while EEFD injects embedding guidance during training to bolster contextual understanding without adding inference cost. On the DCASE 2023 Task 4A public dataset, the proposed approach achieves superior PSDS1 and PSDS2 using only about one-third of the baseline parameters, demonstrating strong potential for efficient, edge-friendly SED systems. The results highlight that integrating mean-teacher stability with embedding-driven context substantially improves performance under tight resource constraints.

Abstract

Sound event detection (SED) is essential for recognizing specific sounds and their temporal locations within acoustic signals. This becomes challenging particularly for on-device applications, where computational resources are limited. To address this issue, we introduce a novel framework referred to as dual knowledge distillation for developing efficient SED systems in this work. Our proposed dual knowledge distillation commences with temporal-averaging knowledge distillation (TAKD), utilizing a mean student model derived from the temporal averaging of the student model's parameters. This allows the student model to indirectly learn from a pre-trained teacher model, ensuring a stable knowledge distillation. Subsequently, we introduce embedding-enhanced feature distillation (EEFD), which involves incorporating an embedding distillation layer within the student model to bolster contextual learning. On DCASE 2023 Task 4A public evaluation dataset, our proposed SED system with dual knowledge distillation having merely one-third of the baseline model's parameters, demonstrates superior performance in terms of PSDS1 and PSDS2. This highlights the importance of proposed dual knowledge distillation for compact SED systems, which can be ideal for edge devices.

Dual Knowledge Distillation for Efficient Sound Event Detection

TL;DR

This work tackles the challenge of running sound event detection on resource-constrained devices by introducing a dual knowledge distillation framework that combines Temporal-Averaging Knowledge Distillation (TAKD) with Embedding-Enhanced Feature Distillation (EEFD). The SE-CRNN backbone is compacted and augmented with SE and tfwSE modules to maintain performance with fewer parameters. TAKD enables indirect learning from a pre-trained teacher through a mean student derived from EMA, while EEFD injects embedding guidance during training to bolster contextual understanding without adding inference cost. On the DCASE 2023 Task 4A public dataset, the proposed approach achieves superior PSDS1 and PSDS2 using only about one-third of the baseline parameters, demonstrating strong potential for efficient, edge-friendly SED systems. The results highlight that integrating mean-teacher stability with embedding-driven context substantially improves performance under tight resource constraints.

Abstract

Sound event detection (SED) is essential for recognizing specific sounds and their temporal locations within acoustic signals. This becomes challenging particularly for on-device applications, where computational resources are limited. To address this issue, we introduce a novel framework referred to as dual knowledge distillation for developing efficient SED systems in this work. Our proposed dual knowledge distillation commences with temporal-averaging knowledge distillation (TAKD), utilizing a mean student model derived from the temporal averaging of the student model's parameters. This allows the student model to indirectly learn from a pre-trained teacher model, ensuring a stable knowledge distillation. Subsequently, we introduce embedding-enhanced feature distillation (EEFD), which involves incorporating an embedding distillation layer within the student model to bolster contextual learning. On DCASE 2023 Task 4A public evaluation dataset, our proposed SED system with dual knowledge distillation having merely one-third of the baseline model's parameters, demonstrates superior performance in terms of PSDS1 and PSDS2. This highlights the importance of proposed dual knowledge distillation for compact SED systems, which can be ideal for edge devices.
Paper Structure (13 sections, 3 equations, 2 figures, 2 tables)

This paper contains 13 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: SE-CRNN model with proposed EEFD. 'BN' and 'CG' denote batch normalization bn and the context gating meanteacherbaseline1, respectively. The shape of (x $\times$ y $\times$ z) indicates (channel $\times$ frame $\times$ frequency).
  • Figure 2: SED system using dual knowledge distillation framework (highlighted by grey box). $\theta_s$ and $\theta_w$ represent the frame and clip level predictions of the student model, respectively.